Productionizing GPU Inference on EKS with KServe and NVIDIA Triton

Babulal Shaik

doi:10.63282/3117-5481/AIJCST-V7I6P104

Authors

Babulal Shaik Cloud Solutions Architect at Amazon Web Services, USA. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V7I6P104

Keywords:

EKS, GPU Inference, KServe, NVIDIA Triton, Kubernetes, MLOps, Model Serving, Autoscaling, Deep Learning, A100 GPU, Performance Optimization, Model Deployment

Abstract

The increasing use of AI-based applications has brought about a necessity for operationalizing GPU inference on a large scale in production environments. But, the deployment and management of GPU-accelerated machine learning models in the wild are still major challenges that arise from the complexity of infrastructure orchestration, cost management, and model lifecycle automation. This paper is an exploration of a complete framework for productionising GPU inference on Amazon Elastic Kubernetes Service (EKS) with the help of KServe and NVIDIA Triton Inference Server, thus providing a simplified route from model deployment to large-scale checkpointed inference. Amazon EKS provides a controlled Kubernetes base that takes care of the automatic scaling, resilience, and security, whereas KServe makes it easy for the models to be served with the help of standardized APIs and native autoscaling for inference workloads. NVIDIA Triton complements this stack by offering the best performance, single or multi-framework, through GPU optimization, dynamic batching, and model ensemble all very important features to the maximum use of the hardware. They all together form a complete pipeline that is compatible with present-day MLOps practices, thus helping the continuous integration, model versioning, and automated rollouts. This paper also talks about the ways of keeping a balance between the performance and the cost such as GPU sharing, autoscaling policies, and efficient pod scheduling. Using this EKS–KServe–Triton trio, organizations can turn experimental ML models into production-grade, scalable inference services, thus closing the gap between model development and real-world deployment with a cloud-native, cost-optimized approach

References

[1] Dhakal, Aditya, Sameer G. Kulkarni, and K. K. Ramakrishnan. "Gslice: controlled spatial sharing of gpus for a scalable inference platform." Proceedings of the 11th ACM Symposium on Cloud Computing. 2020.

[2] Véstias, Mário. "Processing systems for deep learning inference on edge devices." Convergence of Artificial Intelligence and the Internet of Things. Cham: Springer International Publishing, 2020. 213-240.

[3] Kim, JooHwan, Shan Ullah, and Deok-Hwan Kim. "GPU-based embedded edge server configuration and offloading for a neural network service." The Journal of Supercomputing 77.8 (2021): 8593-8621.

[4] True, Thomas, and Gareth Sylvester-Bradley. "An Edge Processing Platform for Media Production." SMPTE 2022 Media Technology Summit. SMPTE, 2022.

[5] Minakova, Svetlana, Erqian Tang, and Todor Stefanov. "Combining task-and data-level parallelism for high-throughput CNN inference on embedded CPUs-GPUs MPSoCs." International Conference on Embedded Computer Systems. Cham: Springer International Publishing, 2020.

[6] Jani, Yash, and Arth Jani. "Robust framework for scalable AI inference using distributed cloud services and event-driven architecture." (2024).

[7] Lu, Chengzhi, et al. "SMIless: Serving DAG-based Inference with Dynamic Invocations under Serverless Computing." SC24: International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024..

[8] Wilkins, Grant. "Online Workload Allocation and Energy Optimization in Large Language Model Inference Systems." (2024).

[9] Koubaa, Anis, et al. "Cloud versus edge deployment strategies of real-time face recognition inference." IEEE Transactions on Network Science and Engineering 9.1 (2021): 143-160.

[10] True, Thomas, and Gareth Sylvester-Bradley. "COTS (commercial-off-the-shelf) platform for media production everywhere." SMPTE Motion Imaging Journal 132.2 (2023): 15-25.

[11] Iusztin, Paul, and Maxime Labonne. LLM Engineer's Handbook: Master the art of engineering large language models from concept to production. Packt Publishing Ltd, 2024.

[12] Zhang, Yi, et al. "Flattenquant: Breaking through the inference compute-bound for large language models with per-tensor quantization." arXiv preprint arXiv:2402.17985 (2024).

[13] Ardestani, Ehsan K., et al. "Supporting massive DLRM inference through software defined memory." 2022 IEEE 42nd International Conference on Distributed Computing Systems (ICDCS). IEEE, 2022.

[14] Jain, Rishabh, et al. "Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs." 2024 57th IEEE/ACM International Symposium on Microarchitecture (MICRO). IEEE, 2024.

[15] Wilkins, Grant, Srinivasan Keshav, and Richard Mortier. "Offline energy-optimal llm serving: Workload-based energy models for llm inference on heterogeneous systems." ACM SIGENERGY Energy Informatics Review 4.5 (2024): 113-119.

Productionizing GPU Inference on EKS with KServe and NVIDIA Triton

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

How to Cite

Similar Articles

Make a Submission

Cover

Menu

Information

Keywords

Publisher

Important Links