AI-Based Modeling of System Reliability and Performance Metrics in Heterogeneous Computing Architectures

Authors

  • Dr. James William School of Computer Science and Software Engineering, University of Melbourne, Australia. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V6I5P101

Keywords:

Heterogeneous Computing, System Reliability, AI Modeling, Performance Prediction, Deep Learning, Reinforcement Learning, Reliability Metrics, HPC, Fault Tolerance, Predictive Maintenance

Abstract

The dramatic increase in the number of heterogeneous computing architectures (HCAs) which include CPUs, GPUs, TPUs, FPGAs, and new neuromorphic processors has transformed the high-performance computing (HPC) and artificial intelligence (AI) workloads. As the complexity of architecture increases there is corresponding uncertainty on reliability and performance checks, particularly when systems are incorporated with different processing elements, interconnections, and memory levels. Modern non-deterministic or non-stochastic heterogeneous environments are associated with traditional reliability models that feel great necessity to model nonlinear interdependencies between components. The current paper proposes an innovative AI-based modeling framework that would be capable of measuring system reliability and performance indicators in the context of HCAs. The suggested solution merges Deep Neural Networks (DNNs), Bayesian Inference Models, and the strategies of Reinforcement Learning (RL) to forecast the increase of the system reliability, the performance bottlenecks, and the mean time between failures (MTBF) at the system level. To train AI models that can capture the multi-domain interactions of computation, communication, and thermal dynamics, a large-scale simulation dataset was created with synthetic benchmarks, actual workloads (SPEC ACCEL, MLPerf, HPCG), unlikely fault profiles, and injected fault profiles to have AI models that are trained to simulate interactions in large-scale systems. The approach puts the feature engineering of runtime telemetry (e.g., power, temperature, utilization) and hardware counters (e.g., instruction-level parallelism, cache miss rates) to construct predictive and adaptive reliability estimators. As opposed to the traditional models, our AI-based system advances its state dynamically with online learning methods of training, allowing its internal state to identify fault and self-optimize it. Findings suggest that the suggested AI system can make accurate predictions in both system performance degradation and reliability estimates with a success level of 96.8 and 94.2 per cent respectively when operating in a heterogeneous environment. Comparative research with classical models of reliability, i.e. Markov and Weibull-based models reveal high gains on adaptability, precision and generalization. Moreover, the optimization using reinforcement learning provided a 1525 percentage ratio in task scheduling performance expressed in a limited thermal and power budget. This article adds to the increasing overlap between system modeling based on AI and the optimization of heterogeneous computing, sets the stage of the introduction of new intelligent reliability management of data centers, autonomous computing systems, and HPC infrastructure

References

[1] Musa, J. D., Iannino, A., & Okumoto, K. (1987). Software Reliability: Measurement, Prediction, Application. McGraw-Hill.

[2] Nelson, W. (2004). Applied Life Data Analysis. John Wiley & Sons.

[3] Trivedi, K. S. (2002). Probability and Statistics with Reliability, Queuing, and Computer Science Applications. John Wiley & Sons.

[4] Kim, D. S., & Park, D. (2015). “Reliability Modeling Using Markov Chains in Computer Systems.” IEEE Transactions on Reliability, 64(3), 1010–1022.

[5] Zimmermann, A., & Hommel, G. (1997). “Modelling and evaluation of computer systems with Petri nets.” Computer Networks and ISDN Systems, 29(9), 1441–1460.

[6] Ebeling, C. E. (2019). An Introduction to Reliability and Maintainability Engineering. Waveland Press.

[7] LeCun, Y., Bengio, Y., & Hinton, G. (2015). “Deep learning.” Nature, 521(7553), 436–444.

[8] Zhang, Z., Wang, X., & Zhao, Y. (2020). “Convolutional Neural Network-Based Fault Diagnosis for Rotating Machinery Using Vibration Data.” IEEE Access, 8, 219861–219873.

[9] Malhotra, P., Vig, L., Shroff, G., & Agarwal, P. (2015). “Long Short Term Memory Networks for Anomaly Detection in Time Series.” Proceedings of the 23rd European Symposium on Artificial Neural Networks (ESANN), 89–94.

[10] Liu, Q., Peng, Y., & Kang, R. (2019). “A Review on Artificial Intelligence in Prognostics and Health Management.” IEEE Access, 7, 162415–162438.

[11] Mishra, S., & Varghese, G. (2021). “Machine Learning for Reliability Prediction: A Systematic Review.” Journal of Systems and Software, 176, 110936.

[12] Balaprakash, P., Tiwari, A., & Wild, S. M. (2018). “Auto-tuning in High-Performance Computing Applications.” IEEE Transactions on Parallel and Distributed Systems, 29(4), 873–888.

[13] Chen, T., & Guestrin, C. (2016). “XGBoost: A Scalable Tree Boosting System.” Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD), 785–794.

[14] Google DeepMind. (2022). “Data center cooling optimization using deep reinforcement learning.” Nature, 608(7921), 540–545.

[15] Zhang, J., Yang, Y., & Wang, C. (2023). “Integrating Performance and Reliability Modeling for Heterogeneous Systems Using Machine Learning.” IEEE Transactions on Parallel and Distributed Systems, 34(2), 411–425.

[16] Mohanarajesh Kommineni. Revanth Parvathi. (2013) Risk Analysis for Exploring the Opportunities in Cloud Outsourcing.

[17] Designing LTE-Based Network Infrastructure for Healthcare IoT Application - Varinder Kumar Sharma - IJAIDR Volume 10, Issue 2, July-December 2019. DOI 10.71097/IJAIDR.v10.i2.1540

[18] Aragani, Venu Madhav and Maroju, Praveen Kumar and Mudunuri, Lakshmi Narasimha Raju, Efficient Distributed Training through Gradient Compression with Sparsification and Quantization Techniques (September 29, 2021). Available at SSRN: https://ssrn.com/abstract=5022841 or http://dx.doi.org/10.2139/ssrn.5022841

[19] P. K. Maroju, "Empowering Data-Driven Decision Making: The Role of Self-Service Analytics and Data Analysts in Modern Organization Strategies," International Journal of Innovations in Applied Science and Engineering (IJIASE), vol. 7, Aug. 2021.

[20] Lakshmi Narasimha Raju Mudunuri, “AI Powered Supplier Selection: Finding the Perfect Fit in Supply Chain Management”, IJIASE, January-December 2021, Vol 7; 211-231.

[21] Kommineni, M. "Explore Knowledge Representation, Reasoning, and Planning Techniques for Building Robust and Efficient Intelligent Systems." International Journal of Inventions in Engineering & Science Technology 7.2 (2021): 105- 114.

[22] Thallam, N. S. T. (2021). Performance Optimization in Big Data Pipelines: Tuning EMR, Redshift, and Glue for Maximum Efficiency.

[23] Kanji, R. K. (2021). Federated data governance framework for ensuring quality-assured data sharing and integration in hybrid cloud-based data warehouse ecosystems through advanced ETL/ELT techniques. International Journal of Computer Techniques, 8(3), 1-9.

[24] Reinforcement Learning Applications in Self Organizing Networks - Varinder Kumar Sharma - IJIRCT Volume 7 Issue 1, January-2021. DOI: https://doi.org/10.5281/zenodo.17062920

[25] Thirunagalingam, A. (2022). Enhancing Data Governance Through Explainable AI: Bridging Transparency and Automation. Available at SSRN 5047713.

[26] Kulasekhara Reddy Kotte. 2022. ACCOUNTS PAYABLE AND SUPPLIER RELATIONSHIPS: OPTIMIZING PAYMENT CYCLES TO ENHANCE VENDOR PARTNERSHIPS. International Journal of Advances in Engineering Research , 24(6), PP – 14-24, https://www.ijaer.com/admin/upload/02%20Kulasekhara%20Reddy%20Kotte%2001468.pdf

[27] Gopi Chand Vegineni. 2022. Intelligent UI Designs for State Government Applications: Fostering Inclusion without AI and ML, Journal of Advances in Developmental Research, 13(1), PP – 1-13, https://www.ijaidr.com/research-paper.php?id=1454

[28] Hullurappa, M. (2022). The Role of Explainable AI in Building Public Trust: A Study of AI-Driven Public Policy Decisions. International Transactions in Artificial Intelligence, 6.

[29] Bhagath Chandra Chowdari Marella, “Driving Business Success: Harnessing Data Normalization and Aggregation for Strategic Decision-Making”, International Journal of INTELLIGENT SYSTEMS AND APPLICATIONS IN ENGINEERING, vol. 10, no.2, pp. 308 – 317, 2022. https://ijisae.org/index.php/IJISAE/issue/view/87

[30] Mohanarajesh Kommineni. (2022/11/28). Investigating High-Performance Computing Techniques For Optimizing And Accelerating Ai Algorithms Using Quantum Computing And Specialized Hardware. International Journal Of Innovations In Scientific Engineering. 16. 66-80. (Ijise) 2022.

[31] Naga Surya Teja Thallam. (2022). Enhancing Security in Distributed Systems Using Bastion Hosts, NAT Gateways, and Network ACLs. International Scientific Journal of Engineering and Management, 1(1).

[32] Thallam, N. S. T. (2022). Columnar Storage vs. Row-Based Storage: Performance Considerations for Data Warehousing. Journal of Scientific and Engineering Research, 9(4), 238-249.

[33] Garg, A. (2022). Unified Framework of Blockchain and AI for Business Intelligence in Modern Banking . International Journal of Emerging Research in Engineering and Technology, 3(4), 32-42. https://doi.org/10.63282/3050-922X.IJERET-V3I4P105

[34] Kanji, R. K. (2022). Generative Query Optimization in Data Warehousing: A Foundation Model-Based Approach for Autonomous SQL Generation and Execution Optimization in Hybrid Architectures. Available at SSRN 5401216.

[35] Performance Evaluation of Network Slicing in 5G Core Networks - Varinder Kumar Sharma - IJMRGE 2022; 3(5): 648-654. DOI: https://doi.org/10.54660/.IJMRGE.2022.3.5.648-654

[36] Thirunagalingam, A. (2023). Improving Automated Data Annotation with Self-Supervised Learning: A Pathway to Robust AI Models Vol. 7, No. 7,(2023) ITAI. International Transactions in Artificial Intelligence, 7(7).

[37] Praveen Kumar Maroju, "Optimizing Mortgage Loan Processing in Capital Markets: A Machine Learning Approach, " International Journal of Innovations in Scientific Engineering, 17(1), PP. 36-55 , April 2023.

[38] P. K. Maroju, "Leveraging Machine Learning for Customer Segmentation and Targeted Marketing in BFSI," International Transactions in Artificial Intelligence, vol. 7, no. 7, pp. 1-20, Nov. 2023

[39] Kulasekhara Reddy Kotte. 2023. Leveraging Digital Innovation for Strategic Treasury Management: Blockchain, and Real-Time Analytics for Optimizing Cash Flow and Liquidity in Global Corporation. International Journal of Interdisciplinary Finance Insights, 2(2), PP - 1 - 17, https://injmr.com/index.php/ijifi/article/view/186/45

[40] Lakshmi Narasimha Raju Mudunuri, “Risk Mitigation Through Data Analytics: A Proactive Approach to Sourcing”, Excel International Journal of Technology, Engineering and Management, vol. 10, no.4, pp. 159-170, 2023, https://doi.uk.com/7.000100/EIJTEM

[41] S. Panyaram, "Digital Transformation of EV Battery Cell Manufacturing Leveraging AI for Supply Chain and Logistics Optimization," International Journal of Innovations in Scientific Engineering, vol. 18, no. 1, pp. 78-87, 2023.

[42] Sudheer Panyaram, (2023), AI-Powered Framework for Operational Risk Management in the Digital Transformation of Smart Enterprises.

[43] Hullurappa, M. (2023). Intelligent Data Masking: Using GANs to Generate Synthetic Data for Privacy-Preserving Analytics. International Journal of Inventions in Engineering & Science Technology, 9, 9.

[44] B. C. C. Marella, “Data Synergy: Architecting Solutions for Growth and Innovation,” International Journal of Innovative Research in Computer and Communication Engineering, vol. 11, no. 9, pp. 10551–10560, Sep. 2023.

[45] Mohanarajesh Kommineni, (2023/9/17), Study High-Performance Computing Techniques for Optimizing and Accelerating AI Algorithms Using Quantum Computing and Specialized Hardware, International Journal of Innovations in Applied Sciences & Engineering, 9. 48-59. IJIASE

[46] Settibathini, V. S., Kothuru, S. K., Vadlamudi, A. K., Thammreddi, L., & Rangineni, S. (2023). Strategic analysis review of data analytics with the help of artificial intelligence. International Journal of Advances in Engineering Research, 26, 1-10.

[47] Sehrawat, S. K. (2023). The role of artificial intelligence in ERP automation: state-of-the-art and future directions. Trans Latest Trends Artif Intell, 4(4).

[48] Thallam, N. S. T. (2023). Comparative Analysis of Public Cloud Providers for Big Data Analytics: AWS, Azure, and Google Cloud. International Journal of AI, BigData, Computational and Management Studies, 4(3), 18-29.

[49] Naga Surya Teja Thallam. (2023). High Availability Architectures for Distributed Systems in Public Clouds: Design and Implementation Strategies. European Journal of Advances in Engineering and Technology.

[50] Mukkala, S. R. (2023). A Proficient Hospital Ratings Aware Patient Churn Prediction And Prevention System Using Abg-Fuzzy And Ner-Gfjdkmeans. Educational Administration: Theory and Practice, 29 (03), 1407-1424 Doi: 10.53555/kuey. v29i3, 9511.

[51] Rajesh Kumar Kanji, Vinodkumar Reddy Surasani, Naveen Kumar Kotha and Uday Kiran Chilakalapalli4 (2023). NLP-BASED INTER AND INTRA-SENTENCE RELATIONSHIP ANALYSIS-AWARE BANK CUSTOMER BEHAVIOR ANALYSIS AND PREFERENCE DETECTION USING GLSNSTM. Journal of Computational Analysis and Applications, 31(4), 1834-1857

[52] Varinder Kumar Sharma - 5G-Enabled Mission-Critical Networks Design and Performance Analysis -International Journal on Science and Technology (IJSAT) Volume 14, Issue 4, October-December 2023. https://doi.org/10.71097/IJSAT.v14.i4.7998

Downloads

Published

2024-09-05

Issue

Section

Articles

How to Cite

[1]
J. William, “AI-Based Modeling of System Reliability and Performance Metrics in Heterogeneous Computing Architectures”, AIJCST, vol. 6, no. 5, pp. 1–13, Sep. 2024, doi: 10.63282/3117-5481/AIJCST-V6I5P101.

Similar Articles

1-10 of 103

You may also start an advanced similarity search for this article.