AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems

Authors

  • Dinesh Babu Govindarajulunaidu Sambath Narayanan Independent Researcher, USA. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V7I3P108

Keywords:

Dynamic ETL, Reinforcement Learning, Learned Cost Models, Kubernetes, Spark, Flink, Dataflow, Lakehouse, Data Contracts, Anomaly Detection

Abstract

Cloud-native analytics requires ETL pipelines which are continuously able to adjust themselves to changing volumes of data, changeable schema, and narrow cost-performance SLOs. The paper introduces an AI-based workflow that will turn ETL into a self-optimizing, closed-loop system that is not based on heuristics. A metadata-first control plane models datasets, lineage, contracts, and policies; supervised models learn stage-level tactics (partitioning, join ordering, file sizing, compaction cadence) from execution traces; and a reinforcement-learning scheduler allocates resources and reorders DAGs based on live telemetry (throughput, p95 latency, spill rates, drift scores, and marginal cost). Predictive scaling intelligently right-sizes compute in Kubernetes/Spark/Flink/Dataflow, and Anomaly detector contract-sensitive gates bad inputs and instigates target corrective (imputation, schema evolution, selective backfills) corrective actions. It is governed through policy-as-code that has audit logs, row/column controls, and time-travel semantics in open table formats (Delta/Iceberg/Hudi). The methodology reduced data error rates, improved processing speed, and decreased the time of manual maintenance and reduced effective cloud spending in mixed streaming-batch evaluations by spot-aware allocation and throttling of low-value work. The result is a resilient, transparent ETL fabric that aligns engineering actions with business SLOs and budgets, accelerating trustworthy BI and ML delivery in multi-tenant, cloud-native environments

References

[1] Abharian, Y. (2025). Conceptual Approaches to Optimizing ETL Processes in Distributed Systems. The American Journal of Engineering and Technology, 7(04), 113-118.

[2] Jonnalagadda, A. K., Dutta, K. P., Ranjan, P., & Myakala, P. K. (2025, July). AI and Optimization: Transforming Data Engineering Applications. In Recent Advances in Artificial Intelligence for Sustainable Development (RAISD 2025) (pp. 686-702). Atlantis Press.

[3] Kunungo, S., Ramabhotla, S., & Bhoyar, M. (2018). The Integration of Data Engineering and Cloud Computing in the Age of Machine Learning and Artificial Intelligence. Iconic Research And Engineering Journals, 1(12), 79-84.

[4] ETL & Data Integration for Analytics: Streamlining ETL Processes for Seamless Multi-Source Data Integration, JETIR February 2025, Volume 12, Issue 2. online. https://www.jetir.org/papers/JETIR2502798.pdf

[5] Sambath Narayanan, D. B. G. (2024). Data Engineering for Responsible AI: Architecting Ethical and Transparent Analytical Pipelines. International Journal of Emerging Research in Engineering and Technology, 5(3), 97-105. https://doi.org/10.63282/3050-922X.IJERET-V5I3P110

[6] Simitsis, A., Vassiliadis, P., & Sellis, T. (2005, April). Optimizing ETL processes in data warehouses. In 21st International Conference on Data Engineering (ICDE'05) (pp. 564-575). IEEE.

[7] AI in Data Engineering: Challenges, Best Practices & Tools, lakefs, Online. https://lakefs.io/blog/ai-data-engineering/

[8] Masouleh, M. F., Kazemi, M. A., Alborzi, M., & Eshlaghy, A. T. (2016). Optimization of ETL process in data warehouse through a combination of parallelization and shared cache memory. Engineering, Technology & Applied Science Research, 6(6), 1241-1244.

[9] VERMA, P., & GHAZIABAD, R. (2023). Optimizing ETL Processes for Financial Data Warehousing.

[10] Aitha, A. R. (2022). Cloud Native ETL Pipelines for Real Time Claims Processing in Large Scale Insurers. Available at SSRN 5532601.

[11] Top 7 ETL Trends in 2025: Your Guide to What’s Next, hevodata, 2025. Online. https://hevodata.com/learn/etl-trends/

[12] Heck, P. (2024, April). What about the data? a mapping study on data engineering for ai systems. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI (pp. 43-52).

[13] Prabhakaran, A. K. (2024). Impact of Generative AI on Data Engineering.

[14] Kasture, S., Khalsa, G. K., Maurya, S., Verma, R., & Yadav, A. K. (2025, April). Artificial Intelligence-Driven Cloud-Native Big Data Analytics for Agile Decision-Making in Dynamic Environment. In 2025 4th OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 5.0 (pp. 1-6). IEEE.

[15] Liu, X., & Iftikhar, N. (2015, April). An ETL optimization framework using partitioning and parallelization. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (pp. 1015-1022).

[16] Seenivasan, D. (2024). AI Driven Enhancement of ETL Workflows for Scalable and Efficient Cloud Data Engineering. International Journal of Engineering and Computer Science, 13(06), 10-18535.

[17] Peddisetti, S. (2023). AI-driven data engineering: Streamlining data pipelines for seamless automation in modern analytics. International Journal of Computational Mathematical Ideas (IJCMI), 15(1), 1066-1075.

[18] Srivastava, R. (2021). Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition). BPB Publications.

[19] Joshi, N. (2024). Optimizing Real-Time ETL Pipelines Using Machine Learning Techniques. Available at SSRN 5054767.

[20] Pandit, M. K., Mir, R. N., & Chishti, M. A. (2020). Adaptive task scheduling in IoT using reinforcement learning. International Journal of Intelligent Computing and Cybernetics, 13(3), 261-282.

[21] Melnik, M., & Nasonov, D. (2019). Workflow scheduling using neural networks and reinforcement learning. Procedia computer science, 156, 29-36.

[22] Gadde, H. (2020). AI-Enhanced Data Warehousing: Optimizing ETL Processes for Real-Time Analytics. Revista de Inteligencia Artificial en Medicina, 11(1), 300-327.

Downloads

Published

2025-05-23

Issue

Section

Articles

How to Cite

[1]
D. B. G. Sambath Narayanan, “AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems”, AIJCST, vol. 7, no. 3, pp. 99–109, May 2025, doi: 10.63282/3117-5481/AIJCST-V7I3P108.

Similar Articles

1-10 of 99

You may also start an advanced similarity search for this article.