Apache Spark with Java: Architecture, Performance, and Use Cases

Authors

  • Sammy Brandon Obafemi Awolowo University Ile Ife. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V6I6P104

Keywords:

Apache Spark, Java Programming, Big Data Processing, Distributed Computing, In-Memory Computing, Spark Architecture, Performance Evaluation, Spark Sql, Batch Processing, Iterative Workloads, Scalability, Resource Optimization, Enterprise Data Analytics

Abstract

Apache Spark has emerged as a widely adopted framework for large-scale data processing due to its in-memory computation model and scalability. This study investigates the architecture, performance characteristics, and practical applicability of Apache Spark applications implemented using the Java programming language. The primary purpose of this research is to evaluate how different workload types perform under varying data sizes and to assess the suitability of Java for enterprise-scale Spark deployments.

The methodology involves executing batch processing, SQL-based analytics, and iterative workloads on a distributed Spark cluster while measuring execution time, memory utilization, and performance variability. Each experiment was conducted multiple times to ensure consistency, and statistical analysis was applied to assess performance stability across runs. The results indicate that execution time increases proportionally with dataset size across all workloads. SQL-based workloads demonstrate lower execution times compared to batch and iterative workloads, while iterative workloads exhibit higher memory consumption. Performance variability across repeated executions remains low, indicating stable and predictable system behavior. In conclusion, the findings confirm that Apache Spark with Java provides scalable and reliable performance for large-scale data processing tasks. The study highlights the effectiveness of Spark SQL for analytical workloads and underscores the importance of workload-aware resource tuning, particularly for computation-intensive applications. These results support the continued use of Java-based Spark solutions in enterprise environments.

References

[1] Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M. J., Shenker, S., & Stoica, I. (2012). Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation (NSDI), 2, 15–28.

[2] Zaharia, M., Xin, R. S., Wendell, P., Das, T., Armbrust, M., Dave, A., Meng, X., Rosen, J., Venkataraman, S., Franklin, M. J., Ghodsi, A., Gonzalez, J., Shenker, S., & Stoica, I. (2016). Apache Spark: A unified engine for big data processing. Communications of the ACM, 59(11), 56–65.

[3] Karau, H., Konwinski, A., Wendell, P., & Zaharia, M. (2015). Learning Spark: Lightning-fast big data analysis. O’Reilly Media.

[4] Chambers, B., & Zaharia, M. (2018). Spark: The definitive guide: Big data processing made simple. O’Reilly Media.

[5] Armbrust, M., Das, T., Torres, J., Yavuz, B., Zhu, S., Xin, R., Ghodsi, A., Stoica, I., & Zaharia, M. (2018). Structured streaming: A declarative API for real-time applications in Apache Spark. Proceedings of the 2018 ACM SIGMOD International Conference on Management of Data, 601–613.

[6] Dean, J., & Ghemawat, S. (2008). MapReduce: Simplified data processing on large clusters. Communications of the ACM, 51(1), 107–113.

[7] Geng, J., & Wang, X. (2019). Performance evaluation of Apache Spark for large-scale data processing. Journal of Big Data, 6(1),

[8] Li, Y., Katsipoulakis, N. R., Chandramouli, B., Goldstein, J., & Kossmann, D. (2016). Migrate, reorganize, and recover: Distributed state management in Apache Spark. Proceedings of the VLDB Endowment, 9(11), 948–959

[9] Meng, X., Bradley, J., Yavuz, B., Sparks, E., Venkataraman, S., Liu, D., Freeman, J., Tsai, D., Amde, M., Owen, S., Xin, D., Xin, R., Franklin, M. J., Zadeh, R., Zaharia, M., & Talwalkar, A. (2016). MLlib: Machine learning in Apache Spark. Journal of Machine Learning Research, 17(34), 1–7.

[10] Venkataraman, S., Yang, Z., Franklin, M. J., Recht, B., & Stoica, I. (2016). Ernest: Efficient performance prediction for large-scale advanced analytics. Proceedings of the 13th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 363–378.

[11] Xin, R. S., Rosen, J., Zaharia, M., Franklin, M. J., Shenker, S., & Stoica, I. (2013). Shark: SQL and rich analytics at scale. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, 13–24.

[12] Lu, R., Wu, J., Xie, M., & Li, G. (2017). An empirical study of Apache Spark performance. Proceedings of the IEEE International Conference on Big Data, 220–229

[13] Beam, A. L., & Kohane, I. S. (2018). Big data and machine learning in health care. JAMA, 319(13), 1317–1318.

[14] Chen, J. H., & Asch, S. M. (2017). Machine learning and prediction in medicine—Beyond the peak of inflated expectations. The New England Journal of Medicine, 376(26), 2507–2509.

[15] Esteva, A., Robicquet, A., Ramsundar, B., Kuleshov, V., DePristo, M., Chou, K., Cui, C., Corrado, G. S., Thrun, S., & Dean, J. (2019). A guide to deep learning in healthcare. Nature Medicine, 25(1), 24–29.

[16] Goldstein, B. A., Navar, A. M., Pencina, M. J., & Ioannidis, J. P. A. (2017). Opportunities and challenges in developing risk prediction models with electronic health records data: A systematic review. Journal of the American Medical Informatics Association, 24(1), 198–208.

[17] Johnson, A. E. W., Pollard, T. J., Shen, L., Lehman, L. H., Feng, M., Ghassemi, M., Moody, B., Szolovits, P., Celi, L. A., & Mark, R. G. (2016). MIMIC-III, a freely accessible critical care database. Scientific Data, 3, 160035.

[18] Miotto, R., Li, L., Kidd, B. A., & Dudley, J. T. (2016). Deep patient: An unsupervised representation to predict the future of patients from the electronic health records. Scientific Reports, 6, 26094.

[19] Obermeyer, Z., & Emanuel, E. J. (2016). Predicting the future—Big data, machine learning, and clinical medicine. The New England Journal of Medicine, 375(13), 1216–1219.

[20] Rajkomar, A., Dean, J., & Kohane, I. (2019). Machine learning in medicine. The New England Journal of Medicine, 380(14), 1347–1358.

[21] Shickel, B., Tighe, P. J., Bihorac, A., & Rashidi, P. (2018). Deep EHR: A survey of recent advances in deep learning techniques for electronic health record analysis. Journal of Biomedical Informatics, 83, 168–185.

[22] Ghassemi, M., Naumann, T., Schulam, P., Beam, A. L., Chen, I. Y., & Ranganath, R. (2020). A review of challenges and opportunities in machine learning for health. AMIA Summits on Translational Science Proceedings, 191–200.

[23] Hripcsak, G., & Albers, D. J. (2013). Next-generation phenotyping of electronic health records. Journal of the American Medical Informatics Association, 20(1), 117–121.

[24] Jiang, F., Jiang, Y., Zhi, H., Dong, Y., Li, H., Ma, S., Wang, Y., Dong, Q., Shen, H., & Wang, Y. (2017). Artificial intelligence in healthcare: Past, present and future. Stroke and Vascular Neurology, 2(4), 230–243.

[25] Kourou, K., Exarchos, T. P., Exarchos, K. P., Karamouzis, M. V., & Fotiadis, D. I. (2015). Machine learning applications in cancer prognosis and prediction. Computational and Structural Biotechnology Journal, 13, 8–17.

[26] Liu, Y., Chen, P. H. C., Krause, J., & Peng, L. (2019). How to read articles that use machine learning: Users’ guides to the medical literature. JAMA, 322(18), 1806–1816.

[27] Ohno-Machado, L. (2015). Realizing the full potential of electronic health records: Challenges and opportunities. American Journal of Preventive Medicine, 49(6), 992–995.

[28] Shortliffe, E. H., & Sepúlveda, M. J. (2018). Clinical decision support in the era of artificial intelligence. JAMA, 320(21), 2199–2200.

[29] Zhang, Z., Beck, M. W., Winkler, D. A., Huang, B., Sibanda, W., & Goyal, H. (2018). Opening the black box of neural networks: Methods for interpreting neural network models in clinical applications. Annals of Translational Medicine, 6(11), 216.

[30] Polu, A. R., Buddula, D. V. K. R., Narra, B., Gupta, A., Vattikonda, N., & Patchipulusu, H. (2021). Evolution of AI in Software Development and Cybersecurity: Unifying Automation, Innovation, and Protection in the Digital Age. Available at SSRN 5266517.

[31] Singh, A. A. S., Tamilmani, V., Maniar, V., Kothamaram, R. R., Rajendran, D., & Namburi, V. D. (2021). Predictive Modeling for Classification of SMS Spam Using NLP and ML Techniques. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 2(4), 60-69.

[32] Maniar, V., Tamilmani, V., Kothamaram, R. R., Rajendran, D., Namburi, V. D., & Singh, A. A. S. (2021). Review of Streaming ETL Pipelines for Data Warehousing: Tools, Techniques, and Best Practices. International Journal of AI, BigData, Computational and Management Studies, 2(3), 74-81.

[33] Rajendran, D., Namburi, V. D., Singh, A. A. S., Tamilmani, V., Maniar, V., & Kothamaram, R. R. (2021). Anomaly Identification in IoT-Networks Using Artificial Intelligence-Based Data-Driven Techniques in Cloud Environmen. International Journal of Emerging Trends in Computer Science and Information Technology, 2(2), 83-91.

[34] Kothamaram, R. R., Rajendran, D., Namburi, V. D., Singh, A. A. S., Tamilmani, V., & Maniar, V. (2021). A Survey of Adoption Challenges and Barriers in Implementing Digital Payroll Management Systems in Across Organizations. International Journal of Emerging Research in Engineering and Technology, 2(2), 64-72.

[35] Singh, A. A., Tamilmani, V., Maniar, V., Kothamaram, R. R., Rajendran, D., & Namburi, V. D. (2021). Hybrid AI Models Combining Machine-Deep Learning for Botnet Identification. International Journal of Humanities and Information Technology, (Special 1), 30-45.

[36] Attipalli, A., Enokkaren, S. J., Bitkuri, V., Kendyala, R., Kurma, J., & Mamidala, J. V. (2021). A Review of AI and Machine Learning Solutions for Fault Detection and Self-Healing in Cloud Services. International Journal of AI, BigData, Computational and Management Studies, 2(3), 53-63.

[37] Enokkaren, S. J., Bitkuri, V., Kendyala, R., Kurma, J., Mamidala, J. V., & Attipalli, A. (2021). Enhancing Cloud Infrastructure Security Through AI-Powered Big Data Anomaly Detection. International Journal of Emerging Research in Engineering and Technology, 2(2), 43-54.

[38] Kendyala, R., Kurma, J., Mamidala, J. V., Attipalli, A., Enokkaren, S. J., & Bitkuri, V. (2021). A Survey of Artificial Intelligence Methods in Liquidity Risk Management: Challenges and Future Directions. International Journal of Artificial Intelligence, Data Science, and Machine Learning, 2(1), 35-42.

[39] Bitkuri, V., Kendyala, R., Kurma, J., Mamidala, J. V., Attipalli, A., & Enokkaren, S. J. (2021). A Survey on Hybrid and Multi-Cloud Environments: Integration Strategies, Challenges, and Future Directions. International Journal of Computer Technology and Electronics Communication, 4(1), 3219-3229.

[40] Polu, A. R., Narra, B., Buddula, D. V. K. R., Patchipulusu, H. H. S., Vattikonda, N., & Gupta, A. K. (2022). Blockchain Technology as a Tool for Cybersecurity: Strengths, Weaknesses, and Potential Applications. Unpublished manuscript.

[41] Rajendran, D., Singh, A. A. S., Maniar, V., Tamilmani, V., Kothamaram, R. R., & Namburi, V. D. (2022). Data-Driven Machine Learning-Based Prediction and Performance Analysis of Software Defects for Quality Assurance. Universal Library of Engineering Technology, (Issue).

[42] Namburi, V. D., Rajendran, D., Singh, A. A., Maniar, V., Tamilmani, V., & Kothamaram, R. R. (2022). Machine Learning Algorithms for Enhancing Predictive Analytics in ERP-Enabled Online Retail Platform. International Journal of Advance Industrial Engineering, 10(04), 65-73.

[43] Namburi, V. D., Tamilmani, V., Singh, A. A. S., Maniar, V., Kothamaram, R. R., & Rajendran, D. (2022). Review of Machine Learning Models for Healthcare Business Intelligence and Decision Support. International Journal of AI, BigData, Computational and Management Studies, 3(3), 82-90.

[44] Tamilmani, V., Singh Singh, A. A., Maniar, V., Kothamaram, R. R., Rajendran, D., & Namburi, V. D. (2022). Forecasting Financial Trends Using Time Series Based ML-DL Models for Enhanced Business Analytics. Available at SSRN 5837143.

[45] Bitkuri, V., Kendyala, R., Kurma, J., Mamidala, J. V., Enokkaren, S. J., & Attipalli, A. (2022). Empowering Cloud Security with Artificial Intelligence: Detecting Threats Using Advanced Machine learning Technologies. International Journal of AI, BigData, Computational and Management Studies, 3(4), 49-59.

[46] Attipalli, A., Mamidala, J. V., KURMA, J., Bitkuri, V., Kendyala, R., & Enokkaren, S. (2022). Towards the Efficient Management of Cloud Resource Allocation: A Framework Based on Machine Learning. Available at SSRN 5741265.

[47] Enokkaren, S. J., Attipalli, A., Bitkuri, V., Kendyala, R., Kurma, J., & Mamidala, J. V. (2022). A Deep-Review based on Predictive Machine Learning Models in Cloud Frameworks for the Performance Management. Universal Library of Engineering Technology, (Issue).

[48] Kurma, J., Mamidala, J. V., Attipalli, A., Enokkaren, S. J., Bitkuri, V., & Kendyala, R. (2022). A Review of Security, Compliance, and Governance Challenges in Cloud-Native Middleware and Enterprise Systems. International Journal of Research and Applied Innovations, 5(1), 6434-6443.

[49] Attipalli, A., Enokkaren, S., KURMA, J., Mamidala, J. V., Kendyala, R., & BITKURI, V. (2022). A Deep-Review based on Predictive Machine Learning Models in Cloud Frameworks for the Performance Management. Available at SSRN 5741282.

[50] Bitkuri, V., Kendyala, R., Kurma, J., Mamidala, J. V., Enokkaren, S. J., & Attipalli, A. (2022). Empowering Cloud Security with Artificial Intelligence: Detecting Threats Using Advanced Machine learning Technologies. International Journal of AI, BigData, Computational and Management Studies, 3(4), 49-59.

[51] Chalasani, R., Tyagadurgam, M. S. V., Gangineni, V. N., Pabbineedi, S., Penmetsa, M., & Bhumireddy, J. R. (2022). Leveraging big datasets for machine learning-based anomaly detection in cybersecurity network traffic. Available at SSRN 5538121.

[52] Chundru, S. K., Vangala, S. R., Polam, R. M., Kamarthapu, B., Kakani, A. B., & Nandiraju, S. K. K. (2022). Efficient machine learning approaches for intrusion identification of DDoS attacks in cloud networks. Available at SSRN 5515262.

[53] Chalasani, R., Tyagadurgam, M. S. V., Gangineni, V. N., Pabbineedi, S., Penmetsa, M., & Bhumireddy, J. R. (2022). Leveraging big datasets for machine learning-based anomaly detection in cybersecurity network traffic. Available at SSRN 5538121.

[54] Sandeep Kumar, C., Srikanth Reddy, V., Ram Mohan, P., Bhavana, K., & Ajay Babu, K. (2022). Efficient Machine Learning Approaches for Intrusion Identification of DDoS Attacks in Cloud Networks. J Contemp Edu Theo Artific Intel: JCETAI/101.

[55] Namburi, V. D., Singh, A. A. S., Maniar, V., Tamilmani, V., Kothamaram, R. R., & Rajendran, D. (2023). Intelligent Network Traffic Identification Based on Advanced Machine Learning Approaches. International Journal of Emerging Trends in Computer Science and Information Technology, 4(4), 118-128.

[56] Rajendran, D., Maniar, V., Tamilmani, V., Namburi, V. D., Singh, A. A. S., & Kothamaram, R. R. (2023). CNN-LSTM Hybrid Architecture for Accurate Network Intrusion Detection for Cybersecurity. Journal Of Engineering And Computer Sciences, 2(11), 1-13.

[57] Kothamaram, R. R., Rajendran, D., Namburi, V. D., Tamilmani, V., Singh, A. A., & Maniar, V. (2023). Exploring the Influence of ERP-Supported Business Intelligence on Customer Relationship Management Strategies. International Journal of Technology, Management and Humanities, 9(04), 179-191.

[58] Singh, A. A. S. S., Mania, V., Kothamaram, R. R., Rajendran, D., Namburi, V. D. N., & Tamilmani, V. (2023). Exploration of Java-Based Big Data Frameworks: Architecture, Challenges, and Opportunities.Journal of Artificial Intelligence & Cloud Computing,2(4), 1-8.

[59] Tamilmani, V., Namburi, V. D., Singh Singh, A. A., Maniar, V., Kothamaram, R. R., & Rajendran, D. (2023). Real-Time Identification of Phishing Websites Using Advanced Machine Learning Methods. Available at SSRN 5837142.

[60] Mamidala, J. V., Attipalli, A., Enokkaren, S. J., Bitkuri, V., Kendyala, R., & Kurma, J. (2023). A Survey of Blockchain-Enabled Supply Chain Processes in Small and Medium Enterprises for Transparency and Efficiency. International Journal of Humanities and Information Technology, 5(04), 84-95.

[61] Bitkuri, V., Kendyala, R., Kurma, J., Mamidala, J. V., Enokkaren, S. J., & Attipalli, A. (2023). Efficient Resource Management and Scheduling in Cloud Computing: A Survey of Methods and Emerging Challenges. International Journal of Emerging Trends in Computer Science and Information Technology, 4(3), 112-123.

[62] Mamidala, J. V., Attipalli, A., Enokkaren, S. J., Bitkuri, V., Kendyala, R., & Kurma, J. (2023). A Survey on Hybrid and Multi-Cloud Environments: Integration Strategies, Challenges, and Future Directions. International Journal of Humanities and Information Technology, 5(02), 53-65.

[63] Mamidala, J. V., Enokkaren, S. J., Attipalli, A., Bitkuri, V., Kendyala, R., & Kurma, J. Machine Learning Models Powered by Big Data for Health Insurance Expense Forecasting. International Research Journal of Economics and Management Studies IRJEMS, 2(1).

[64] Bhumireddy, J. R. (2023). A Hybrid Approach for Melanoma Classification using Ensemble Machine Learning Techniques with Deep Transfer Learning Article in Computer Methods and Programs in Biomedicine Update. Available at SSRN 5667650.

[65] From Fragmentation to Focus: The Benefits of Centralizing Procurement. (2023). International Journal of Research and Applied Innovations, 6(6), 9820-9833. https://doi.org/10.15662/IJRAI.2023.0606006

Downloads

Published

2024-11-12

Issue

Section

Articles

How to Cite

[1]
S. Brandon, “Apache Spark with Java: Architecture, Performance, and Use Cases”, AIJCST, vol. 6, no. 6, pp. 35–42, Nov. 2024, doi: 10.63282/3117-5481/AIJCST-V6I6P104.

Similar Articles

61-70 of 132

You may also start an advanced similarity search for this article.