Cooling Domains as First-Class Failure Boundaries in Storage Architecture

Authors

  • Mallikarjun Vppalapati Sr Cloud Systems Engineer at INFOR (US), LLC, USA. Author

DOI:

https://doi.org/10.63282/3117-5481/AIJCST-V6I2P110

Keywords:

Cooling Domains, Storage Architecture, Failure Boundaries, Thermal-Aware Systems, Data Center Reliability, Fault Tolerance

Abstract

Modern large-scale storage systems are expected to provide highly reliable services with minimum downtime even as their scale, density, and the complexity of their operations increase. Reliability engineering has long been centered on clearly defined failure domains such as disks, nodes, racks, and availability zones; however, it has become apparent through real world outages explorations that there exists hiding weaknesses that can be not only the domains themselves but are also the abstractions that is to say, the domains are no longer the problems but that we are looking at domains as the problem without doubt. This paper proposes that storage system architecture should recognize cooling domains as one of the first-class components along with reliability analysis. The working definition of a cooling domain that we use is that it is the group of storage elements which are provided with the same cooling facilities or are thermally so dependent that they will behave as a whole in case of a cooling failure. We point out the weaknesses of traditional failure-domain paradigms and explain how they can fail to recognize system-wide risks resulting from thermal events. Besides architectural analysis and operational telemetry, a case study, which is production-scale storage environment, is used to analyze how cooling-related failure spreads at hardware and software levels. It involves the identification of thermal dependencies, the modeling of correlated failures, and the assessment of availability with and without a cooling-domain-aware context. Our results indicate that not considering cooling domains can lead to a large underestimation of the failure blast radius and recovery time, whereas their integration makes it possible to develop placement policies, redundancy strategies, and failure isolation mechanisms more accurately. This paper offers definitions of cooling domains, practical identification approaches, as well as architectural integration guidelines.

References

[1] Calder, Brad, et al. "Windows azure storage: a highly available cloud storage service with strong consistency." Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles. 2011.

[2] Dean, Jeffrey, and Sanjay Ghemawat. “MapReduce: Simplified Data Processing on Large Clusters.” Communications of the ACM, vol. 51, no. 1, 2008, pp. 107–113.

[3] Kumar Doodala, Appala Nooka. “Offline-First Android Architecture for Waste Management in Low Connectivity Zones”. International Journal of Emerging Trends in Computer Science and Information Technology, vol. 4, no. 1, Mar. 2023, pp. 201-9.

[4] Dimakis, Alexandros G., et al. “Network Coding for Distributed Storage Systems.” IEEE Transactions on Information Theory, vol. 56, no. 9, 2010, pp. 4539–4551.

[5] Fan, Bin, et al. “Cuckoo Filter: Practically Better Than Bloom.” Proceedings of the 10th ACM International Conference on Emerging Networking Experiments and Technologies (CoNEXT ’14), 2014, pp. 75–88.

[6] Ford, Daniel, et al. “Availability in Globally Distributed Storage Systems.” Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’10), 2010.

[7] Katangoori, Sivadeep, and Anudeep Katangoori. "Intelligent ETL Orchestration With Reinforcement Learning and Bayesian Optimization." American Journal of Data Science and Artificial Intelligence Innovations 3 (2023): 458-488.

[8] Ghemawat, Sanjay, Howard Gobioff, and Shun-Tak Leung. “The Google File System.” Proceedings of the 19th ACM Symposium on Operating Systems Principles (SOSP ’03), 2003, pp. 29–43.

[9] Gaddam, Rohit Reddy. “Hermetic ML Environments Using Conda-Lock and Docker”. American International Journal of Computer Science and Technology, vol. 3, no. 4, July 2021, pp. 22-34

[10] Greenberg, Albert, et al. “VL2: A Scalable and Flexible Data Center Network.” Proceedings of the ACM SIGCOMM 2009 Conference, 2009, pp. 51–62.

[11] Parakala, Adityamallikarjunkumar. "Hyperautomation & Cloud RPA." International Journal of Emerging Trends in Computer Science and Information Technology 4.2 (2023): 139-150.

[12] Lakshman, Avinash, and Prashant Malik. “Cassandra: A Decentralized Structured Storage System.” ACM SIGOPS Operating Systems Review, vol. 44, no. 2, 2010, pp. 35–40.

[13] Shiramalla, Rupesh, and Bhavitha Guntupalli. "Cost-Effective Softphone Integration in CRM Platforms Using RESTful APIs: A Salesforce Case Study for Voice-to-Text Sales Enablement." International Journal of Emerging Trends in Computer Science and Information Technology 2.1 (2021): 101-114.

[14] Suryadevara, Siva Sai Krishna, and Santosh Nakirikanti. “Privacy-Preserving Personalization Using Federated Learning in AEM ”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 4, Dec. 2023, pp. 190-9.

[15] Moore, Justin, et al. “Making Scheduling ‘Cool’: Temperature-Aware Workload Placement in Data Centers.” Proceedings of the USENIX Annual Technical Conference (USENIX ’05), 2005.

[16] Takkalapally, DevenderRao, and Mahender Rao Takkellapally. “GC-TuneHFT: AI-Based Garbage Collection Optimization in High-Frequency Trading Environments”. American International Journal of Computer Science and Technology, vol. 5, no. 6, Nov. 2023, pp. 25-37

[17] Gaddam, Rohit Reddy. “Progressive Delivery for Models With Quality KPIs”. American International Journal of Computer Science and Technology, vol. 5, no. 4, July 2023, pp. 33-47

[18] Patterson, David A., Garth Gibson, and Randy H. Katz. “A Case for Redundant Arrays of Inexpensive Disks (RAID).” Proceedings of the ACM SIGMOD International Conference on Management of Data, 1988, pp. 109–116.

[19] Muppaneni, Rajarshi Krishna. “Data Privacy in the Age of AI: How Dynamics 365 Handles Regulatory Challenges”. International Journal of Artificial Intelligence, Data Science, and Machine Learning, vol. 3, no. 4, Dec. 2022, pp. 159-70.

[20] Shiramalla, Rupesh. "Optimizing Cross-Platform Enterprise Integrations Using Workato: A Case Study of Salesforce and Oracle SaaS Applications." International Journal of Emerging Trends in Computer Science and Information Technology 4.1 (2023): 232-243.

[21] Pinheiro, Eduardo, Wolf-Dietrich Weber, and Luiz André Barroso. “Failure Trends in a Large Disk Drive Population.” Proceedings of the 5th USENIX Conference on File and Storage Technologies (FAST ’07), 2007, pp. 17–28.

[22] Parakala, Adityamallikarjunkumar. "Building ROI-Driven Bots: From Insights Dashboards to Outcome Tracking." International Journal of Emerging Research in Engineering and Technology 4.1 (2023): 112-123.

[23] Plank, James S. “A Tutorial on Reed–Solomon Coding for Fault-Tolerance in RAID-Like Systems.” Software: Practice and Experience, vol. 27, no. 9, 1997, pp. 995–1012.

[24] Datla, Lalith Sriram. "Identity Threat Detection: Techniques for Preventing Credential Abuse in Cloud Systems." International Journal of Emerging Trends in Computer Science and Information Technology 2.4 (2021): 95-104.

[25] Muppaneni , Kavya. “Virtual DOM Vs Real DOM: Performance Benchmarks”. International Journal of AI, BigData, Computational and Management Studies, vol. 4, no. 4, Dec. 2023, pp. 180-9.

[26] Vogels, Werner. “Eventually Consistent.” Communications of the ACM, vol. 52, no. 1, 2009, pp. 40–44.

[27] Weil, Sage A., et al. “CRUSH: Controlled, Scalable, Decentralized Placement of Replicated Data.” Proceedings of the ACM/IEEE Conference on Supercomputing (SC ’06), 2006.

[28] Zomaya, Albert Y., and Young Choon Lee, editors. Energy-Efficient Distributed Computing Systems. Wiley, 2012.

Downloads

Published

2024-03-22

Issue

Section

Articles

How to Cite

[1]
M. Vppalapati, “Cooling Domains as First-Class Failure Boundaries in Storage Architecture”, AIJCST, vol. 6, no. 2, pp. 96–106, Mar. 2024, doi: 10.63282/3117-5481/AIJCST-V6I2P110.

Similar Articles

101-110 of 211

You may also start an advanced similarity search for this article.