Multi-Agent Systems for Autonomous Orchestration in AI-Driven Computing Networks

Chloe Bennett; Adichie Okafor

doi:10.63282/3117-5481/AIJCST-V3I4P102

Authors

Dr. Chloe Bennett School of Science and Technology, University of Abuja, Abuja, Nigeria. Author
Adichie Okafor School of Science and Technology, University of Abuja, Abuja, Nigeria. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V3I4P102

Keywords:

Multi-Agent Systems, Autonomous Orchestration, Edge–Cloud Computing, Multi-Agent Reinforcement Learning, Distributed Optimization, Intent-Based Networking, SLA-Aware Scheduling, Digital Twins, Self-Healing, Federated Coordination, Zero-Trust Security, Explainability

Abstract

AI-driven computing networks spanning edge, fog, and cloud demand real-time coordination under volatile workloads, heterogeneous resources, and strict service-level objectives. This paper proposes a multi-agent systems (MAS) architecture for autonomous orchestration that couples decentralized decision-making with global policy compliance. Specialized agents scheduler, scaler, placement, data, and security sentinels negotiate via market-based mechanisms and cooperative game-theoretic protocols to allocate compute, memory, and bandwidth while respecting latency budgets and energy caps. Learning-enabled controllers combine model-predictive scheduling with multi-agent reinforcement learning to adapt to demand surges, drift, and failures; safety layers constrain exploration through formal guards and intent-based policies. To improve robustness, agents share compact state via a publish–subscribe control plane and use digital-twin simulations for counterfactual rollouts before enacting changes in production. The design supports privacy-preserving analytics with federated coordination at the edge and employs trust scoring and zero-trust enforcement to mitigate adversarial behavior and misconfigurations. We present a reference implementation with pluggable observability hooks and outline evaluation metrics for tail latency, SLA adherence, energy per inference, recovery time, and orchestration overhead. Results demonstrate that MAS-based orchestration can reduce p95 latency and policy-violation rates while improving resource utilization and fault tolerance, suggesting a practical path to self-optimizing, self-healing AI infrastructure across heterogeneous, multi-tenant environments

References

[1] Sze, V., Chen, Y.-H., Yang, T.-J., & Emer, J. (2017). Efficient Processing of Deep Neural Networks. Proceedings of the IEEE. https://arxiv.org/abs/1703.09039

[2] Dean, J., & Barroso, L. A. (2013). The Tail at Scale. Communications of the ACM. https://dl.acm.org/doi/10.1145/2483852.2510665

[3] Mao, H., Alizadeh, M., Menache, I., & Kandula, S. (2016). Resource Management with Deep Reinforcement Learning. https://arxiv.org/abs/1608.07836

[4] Lowe, R., Wu, Y., Tamar, A., Harb, J., Abbeel, P., & Mordatch, I. (2017). Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments. https://arxiv.org/abs/1706.02275

[5] Rashid, T., Samvelyan, M., De Witt, C. S., Farquhar, G., Foerster, J., & Whiteson, S. (2018). QMIX: Monotonic Value Function Factorisation for Deep Multi-Agent Reinforcement Learning. https://arxiv.org/abs/1803.11485

[6] Foerster, J., Farquhar, G., Afouras, T., Nardelli, N., & Whiteson, S. (2018). Counterfactual Multi-Agent Policy Gradients. https://arxiv.org/abs/1705.08926

[7] Yang, Y., Luo, R., Li, M., Zhou, M., Zhang, W., & Wang, J. (2018). Mean Field Multi-Agent Reinforcement Learning. https://arxiv.org/abs/1802.05438

[8] Sutton, R. S., & Barto, A. G. (2018). Reinforcement Learning: An Introduction (2nd ed.). http://incompleteideas.net/book/the-book-2nd.html

[9] Schulman, J., Wolski, F., Dhariwal, P., Radford, A., & Klimov, O. (2017). Proximal Policy Optimization Algorithms. https://arxiv.org/abs/1707.06347

[10] Moritz, P., Nishihara, R., Wang, S., Tumanov, A., Liaw, R., Liang, E., … Stoica, I. (2018). Ray: A Distributed Framework for Emerging AI Applications. OSDI. https://arxiv.org/abs/1712.05889

[11] FIPA. (2002). FIPA ACL Message Structure Specification. http://www.fipa.org/specs/fipa00061/SC00061G.html

[12] Feamster, N., Rexford, J., & Zegura, E. (2013). The Road to SDN: An Intellectual History of Programmable Networks. ACM SIGCOMM CCR. https://dl.acm.org/doi/10.1145/2491185.2491191

[13] Camacho, E. F., & Bordons, C. (2007). Model Predictive Control (2nd ed.). Springer. https://link.springer.com/book/10.1007/978-0-85729-398-5

[14] Snoek, J., Larochelle, H., & Adams, R. P. (2012). Practical Bayesian Optimization of Machine Learning Algorithms. NIPS. https://arxiv.org/abs/1206.2944

[15] Kritzinger, W., Karner, M., Traar, G., Henjes, J., & Sihn, W. (2018). Digital Twin in Manufacturing: A Categorical Literature Review and Classification. IFAC-Papers OnLine. https://ieeexplore.ieee.org/document/8343161

[16] Rose, S., Borchert, O., Mitchell, S., & Connelly, S. (2020). Zero Trust Architecture (NIST SP 800-207). https://csrc.nist.gov/publications/detail/sp/800-207/final

[17] McMahan, H. B., Moore, E., Ramage, D., Hampson, S., & y Arcas, B. A. (2017). Communication-Efficient Learning of Deep Networks from Decentralized Data. https://arxiv.org/abs/1602.05629

[18] Dwork, C., McSherry, F., Nissim, K., & Smith, A. (2006). Calibrating Noise to Sensitivity in Private Data Analysis. TCC. https://link.springer.com/chapter/10.1007/11681878_14

[19] Bonawitz, K., Ivanov, V., Kreuter, B., Marcedone, A., McMahan, H. B., Patel, S., … Seth, K. (2017). Practical Secure Aggregation for Privacy-Preserving Machine Learning. ACM CCS. https://arxiv.org/abs/1611.04482

[20] Fioretto, F., Pontelli, E., & Yeoh, W. (2018). A Survey of Distributed Constraint Optimization Problems. Journal of Artificial Intelligence Research. https://www.jair.org/index.php/jair/article/view/11043

[21] Smith, R. G. (1980). The Contract Net Protocol: High-Level Communication and Control in a Distributed Problem Solver. IEEE Transactions on Computers. https://ieeexplore.ieee.org/document/58325

[22] Ongaro, D., & Ousterhout, J. (2014). In Search of an Understandable Consensus Algorithm (Raft). https://raft.github.io/raft.pdf

[23] Lamport, L. (1998). The Part-Time Parliament (Paxos). ACM Transactions on Computer Systems. https://lamport.azurewebsites.net/pubs/lamport-paxos.pdf

[24] Nichols, K., Jacobson, V., McGregor, A., & Iyengar, J. (2018). Controlled Delay Active Queue Management (CoDel) (RFC 8289). https://www.rfc-editor.org/rfc/rfc8289

[25] Schwartz, R., Dodge, J., Smith, N. A., & Etzioni, O. (2020). Green AI. Communications of the ACM. https://dl.acm.org/doi/10.1145/3381831

[26] Thallam, N. S. T. (2020). The Evolution of Big Data Workflows: From On-Premise Hadoop to Cloud-Based Architectures.

[27] Optimizing LTE RAN for High-Density Event Environments: A Case Study from Super Bowl Deployments - Varinder Kumar Sharma - IJAIDR Volume 11, Issue 1, January-June 2020. DOI 10.71097/IJAIDR.v11.i1.1542

Multi-Agent Systems for Autonomous Orchestration in AI-Driven Computing Networks

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Make a Submission

Cover

Menu

Information

Keywords

Publisher

Important Links