Enhancing the Reliability of Cloud-Based Software Systems Using AI-Driven Fault Prediction and Auto-Remediation Techniques

M. Riyaz Mohammed

doi:10.63282/3117-5481/AIJCST-V3I5P101

Authors

M. Riyaz Mohammed Department of Computer Science & IT, Jamal Mohamed College (Autonomous), Tiruchirapalli, Tamil Nadu, India. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V3I5P101

Keywords:

Cloud Reliability, Aiops, Fault Prediction, Auto-Remediation, Safe Reinforcement Learning, Root-Cause Analysis, Opentelemetry, Kubernetes, Service Mesh, Chaos Engineering, Digital Twin, Site Reliability Engineering (SRE), Mlops, Multivariate Time-Series Modeling, Knowledge Graphs, Canary Rollback, SLO/SLA And Error Budgets, Anomaly Detection, Causal Inference, Policy Engine

Abstract

Cloud-native systems operate at a scale and complexity where manual fault management cannot keep pace with dynamic workloads, ephemeral infrastructure, and intricate service dependencies. This paper presents an end-to-end AI-driven framework that couples proactive fault prediction with safe, policy-constrained auto-remediation to enhance the reliability of cloud-based software systems. Streaming telemetry logs, metrics, traces, and change events collected via OpenTelemetry is embedded using self-supervised representations for multivariate time series, fused with a service-dependency graph to support causal reasoning and fast root-cause localization. A hybrid predictor (temporal deep models with gradient-boosted residuals) yields early-warning scores and failure modes, while a safe reinforcement-learning policy executes guarded actions such as canary rollback, traffic shifting, pod restarts, autoscaling, circuit breaking, and configuration reversion. Guardrails combine SRE runbooks, SLO/error-budget constraints, and a digital-twin simulator validated by chaos experiments to prevent harmful interventions. Evaluated on Kubernetes microservices with fault injections (CPU throttling, memory leak, pod crash, network latency, and dependency outage), the approach achieved early-warning AUC > 0.92 with 3–7 minutes mean lead time, reduced MTTR by 30–60%, lowered error-budget burn by 25–40%, and curtailed p95 tail latency during incidents without increasing steady-state cost. Ablations confirm the contributions of graph-aware features and safety constraints; interpretability is provided via SHAP at the signal level and causal subgraph explanations at the service level. The framework operationalizes AIOps for continuous reliability improvement and can be adopted incrementally within existing SRE workflows

References

[1] Beyer, B., Jones, C., Petoff, J., & Murphy, N. (Eds.). (2016). Site Reliability Engineering: How Google Runs Production Systems. O’Reilly. https://sre.google/sre-book/table-of-contents/

[2] Beyer, B., Murphy, N., Rensin, D., Kawahara, T., & Thorne, S. (2018). The Site Reliability Workbook. O’Reilly. https://sre.google/workbook/table-of-contents/

[3] Krishnan, B., et al. (2020). Building Secure and Reliable Systems. O’Reilly/Google. https://google.github.io/building-secure-and-reliable-systems/raw/toc.html

[4] Principles of Chaos Engineering. (2019). https://principlesofchaos.org/

[5] Mao, H., Alizadeh, M., Menache, I., & Kandula, S. (2016). Resource Management with Deep Reinforcement Learning. HotNets. https://people.csail.mit.edu/alizadeh/papers/deeprm-hotnets16.pdf

[6] Lundberg, S. M., & Lee, S.-I. (2017). A Unified Approach to Interpreting Model Predictions (SHAP). NeurIPS. https://arxiv.org/abs/1705.07874

[7] Liu, F. T., Ting, K. M., & Zhou, Z.-H. (2008/2012). Isolation Forest. ICDM/TKDD. https://seppe.net/aa/papers/iforest.pdf

[8] Bai, S., Kolter, J. Z., & Koltun, V. (2018). An Empirical Evaluation of Generic Convolutional and Recurrent Networks for Sequence Modeling (TCN). https://arxiv.org/abs/1803.01271

[9] Du, M., Li, F., Zheng, G., & Srikumar, V. (2017). DeepLog: Anomaly Detection and Diagnosis from System Logs through Deep Learning. CCS. https://dl.acm.org/doi/10.1145/3133956.3134015

[10] Google SRE. (2018). Error Budget Policy (Workbook). https://sre.google/workbook/error-budget-policy/

[11] Google SRE. (2018). Alerting on SLOs (Burn Rate). https://sre.google/workbook/alerting-on-slos/

[12] Optimizing LTE RAN for High-Density Event Environments: A Case Study from Super Bowl Deployments - Varinder Kumar Sharma - IJAIDR Volume 11, Issue 1, January-June 2020. DOI 10.71097/IJAIDR.v11.i1.1542

[13] Avizienis, A., Laprie, J.-C., Randell, B., & Landwehr, C. “Basic Concepts and Taxonomy of Dependable and Secure Computing.” IEEE Transactions on Dependable and Secure Computing, 1(1), 11-33, 2004.

[14] Patterson, D. A., Gibson, G., & Katz, R. H. “A Case for Redundant Arrays of Inexpensive Disks (RAID).” ACM SIGMOD Record, 17(3), 109-116, 1988.

[15] Gray, J. “Why Do Computers Stop and What Can Be Done About It?” Symposium on Reliability in Distributed Software and Database Systems, Los Angeles, CA, 1986.

[16] Vaidya, N. H. “Impact of Checkpoint Latency on Overhead Ratio in Rollback Recovery Schemes.” IEEE Transactions on Computers, 46(8), 942-947, 1997.

[17] Hellerstein, J. L., Diao, Y., Parekh, S., & Tilbury, D. M. Feedback Control of Computing Systems. Wiley-IEEE Press, 2004.

[18] Dean, J., & Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM, 51(1), 107-113, 2008.

[19] Kephart, J. O., & Chess, D. M. “The Vision of Autonomic Computing.” IEEE Computer, 36(1), 41-50, 2003.

[20] Salehie, M., & Tahvildari, L. “Self-Adaptive Software: Landscape and Research Challenges.” ACM Transactions on Autonomous and Adaptive Systems, 4(2), 1-42, 2009.

[21] Chen, M., Zheng, A. X., Lloyd, J., Jordan, M. I., & Brewer, E. “Failure Diagnosis Using Decision Trees.” Proceedings of the International Conference on Autonomic Computing (ICAC), 2004, pp. 36-43.

[22] Zheng, A. X., & Jordan, M. I. “Statistical Techniques for Online Anomaly Detection in Large-Scale Systems.” Proceedings of the 2005 USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2005.

[23] Salfner, F., Lenk, M., & Malek, M. “A Survey of Online Failure Prediction Methods.” ACM Computing Surveys, 42(3), 1-42, 2010.

Enhancing the Reliability of Cloud-Based Software Systems Using AI-Driven Fault Prediction and Auto-Remediation Techniques

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Most read articles by the same author(s)

Similar Articles

Make a Submission

Cover

Menu

Information

Keywords

Publisher

Important Links