AI-Driven Data Engineering Workflows for Dynamic ETL Optimization in Cloud-Native Data Analytics Ecosystems
DOI:
https://doi.org/10.63282/3117-5481/AIJCST-V7I3P108Keywords:
Dynamic ETL, Reinforcement Learning, Learned Cost Models, Kubernetes, Spark, Flink, Dataflow, Lakehouse, Data Contracts, Anomaly DetectionAbstract
Cloud-native analytics requires ETL pipelines which are continuously able to adjust themselves to changing volumes of data, changeable schema, and narrow cost-performance SLOs. The paper introduces an AI-based workflow that will turn ETL into a self-optimizing, closed-loop system that is not based on heuristics. A metadata-first control plane models datasets, lineage, contracts, and policies; supervised models learn stage-level tactics (partitioning, join ordering, file sizing, compaction cadence) from execution traces; and a reinforcement-learning scheduler allocates resources and reorders DAGs based on live telemetry (throughput, p95 latency, spill rates, drift scores, and marginal cost). Predictive scaling intelligently right-sizes compute in Kubernetes/Spark/Flink/Dataflow, and Anomaly detector contract-sensitive gates bad inputs and instigates target corrective (imputation, schema evolution, selective backfills) corrective actions. It is governed through policy-as-code that has audit logs, row/column controls, and time-travel semantics in open table formats (Delta/Iceberg/Hudi). The methodology reduced data error rates, improved processing speed, and decreased the time of manual maintenance and reduced effective cloud spending in mixed streaming-batch evaluations by spot-aware allocation and throttling of low-value work. The result is a resilient, transparent ETL fabric that aligns engineering actions with business SLOs and budgets, accelerating trustworthy BI and ML delivery in multi-tenant, cloud-native environments
References
[1] Abharian, Y. (2025). Conceptual Approaches to Optimizing ETL Processes in Distributed Systems. The American Journal of Engineering and Technology, 7(04), 113-118.
[2] Jonnalagadda, A. K., Dutta, K. P., Ranjan, P., & Myakala, P. K. (2025, July). AI and Optimization: Transforming Data Engineering Applications. In Recent Advances in Artificial Intelligence for Sustainable Development (RAISD 2025) (pp. 686-702). Atlantis Press.
[3] Kunungo, S., Ramabhotla, S., & Bhoyar, M. (2018). The Integration of Data Engineering and Cloud Computing in the Age of Machine Learning and Artificial Intelligence. Iconic Research And Engineering Journals, 1(12), 79-84.
[4] ETL & Data Integration for Analytics: Streamlining ETL Processes for Seamless Multi-Source Data Integration, JETIR February 2025, Volume 12, Issue 2. online. https://www.jetir.org/papers/JETIR2502798.pdf
[5] Sambath Narayanan, D. B. G. (2024). Data Engineering for Responsible AI: Architecting Ethical and Transparent Analytical Pipelines. International Journal of Emerging Research in Engineering and Technology, 5(3), 97-105. https://doi.org/10.63282/3050-922X.IJERET-V5I3P110
[6] Simitsis, A., Vassiliadis, P., & Sellis, T. (2005, April). Optimizing ETL processes in data warehouses. In 21st International Conference on Data Engineering (ICDE'05) (pp. 564-575). IEEE.
[7] AI in Data Engineering: Challenges, Best Practices & Tools, lakefs, Online. https://lakefs.io/blog/ai-data-engineering/
[8] Masouleh, M. F., Kazemi, M. A., Alborzi, M., & Eshlaghy, A. T. (2016). Optimization of ETL process in data warehouse through a combination of parallelization and shared cache memory. Engineering, Technology & Applied Science Research, 6(6), 1241-1244.
[9] VERMA, P., & GHAZIABAD, R. (2023). Optimizing ETL Processes for Financial Data Warehousing.
[10] Aitha, A. R. (2022). Cloud Native ETL Pipelines for Real Time Claims Processing in Large Scale Insurers. Available at SSRN 5532601.
[11] Top 7 ETL Trends in 2025: Your Guide to What’s Next, hevodata, 2025. Online. https://hevodata.com/learn/etl-trends/
[12] Heck, P. (2024, April). What about the data? a mapping study on data engineering for ai systems. In Proceedings of the IEEE/ACM 3rd International Conference on AI Engineering-Software Engineering for AI (pp. 43-52).
[13] Prabhakaran, A. K. (2024). Impact of Generative AI on Data Engineering.
[14] Kasture, S., Khalsa, G. K., Maurya, S., Verma, R., & Yadav, A. K. (2025, April). Artificial Intelligence-Driven Cloud-Native Big Data Analytics for Agile Decision-Making in Dynamic Environment. In 2025 4th OPJU International Technology Conference (OTCON) on Smart Computing for Innovation and Advancement in Industry 5.0 (pp. 1-6). IEEE.
[15] Liu, X., & Iftikhar, N. (2015, April). An ETL optimization framework using partitioning and parallelization. In Proceedings of the 30th Annual ACM Symposium on Applied Computing (pp. 1015-1022).
[16] Seenivasan, D. (2024). AI Driven Enhancement of ETL Workflows for Scalable and Efficient Cloud Data Engineering. International Journal of Engineering and Computer Science, 13(06), 10-18535.
[17] Peddisetti, S. (2023). AI-driven data engineering: Streamlining data pipelines for seamless automation in modern analytics. International Journal of Computational Mathematical Ideas (IJCMI), 15(1), 1066-1075.
[18] Srivastava, R. (2021). Cloud Native Microservices with Spring and Kubernetes: Design and Build Modern Cloud Native Applications using Spring and Kubernetes (English Edition). BPB Publications.
[19] Joshi, N. (2024). Optimizing Real-Time ETL Pipelines Using Machine Learning Techniques. Available at SSRN 5054767.
[20] Pandit, M. K., Mir, R. N., & Chishti, M. A. (2020). Adaptive task scheduling in IoT using reinforcement learning. International Journal of Intelligent Computing and Cybernetics, 13(3), 261-282.
[21] Melnik, M., & Nasonov, D. (2019). Workflow scheduling using neural networks and reinforcement learning. Procedia computer science, 156, 29-36.
[22] Gadde, H. (2020). AI-Enhanced Data Warehousing: Optimizing ETL Processes for Real-Time Analytics. Revista de Inteligencia Artificial en Medicina, 11(1), 300-327.
