Research Archive

Journal Article Subscription Cloud Computing

Infrastructure as Code: Principles, Patterns, and Pitfalls in Cloud-Native DevOps Environments

Infrastructure as Code (IaC) represents a paradigm shift in how cloud infrastructure is provisioned, managed, and evolved, yet its practical adoption is accompanied by a set of under-documented failure patterns. This paper conducts a systematic literature review of 94 peer-reviewed publications combined with a practitioner survey of 340 DevOps engineers across North America and Europe. We categorize IaC tools into three architectural families — declarative, imperative, and hybrid — and evaluate them against six quality dimensions: idempotency, modularity, testability, auditability, portability, and drift detection. Our survey reveals that 67% of teams encounter configuration drift within six months of initial deployment, and that fewer than 30% implement automated compliance checks on their IaC manifests. We introduce the concept of "infrastructure entropy" to describe the gradual degradation of alignment between declared and actual infrastructure state, and propose a set of 14 engineering practices — collectively termed the IaC Hygiene Framework — to mitigate it. Case evidence from three organizations using Terraform, Ansible, and Pulumi respectively is used to validate the framework. This research provides both theoretical grounding and practical tooling guidance for organizations pursuing robust cloud infrastructure automation.

Dmitri Volkov, Samantha Osei-Bonsu, Jiaying Wu, Carlos Mendez-Rios· Aug 2016· 278 citations

Journal Article Open Access Artificial Intelligence

Deep Reinforcement Learning for Adaptive Resource Allocation in Multi-Tenant Cloud Data Centers: Architecture, Training Regimes, and Production Evaluation

Resource allocation in multi-tenant cloud data centers has traditionally been governed by hand-crafted heuristics that fail to adapt to the non-stationary workload distributions characteristic of production environments. This paper presents DeepAlloc, a deep reinforcement learning framework for adaptive resource allocation that replaces static scheduling policies with neural policy networks trained using Proximal Policy Optimization (PPO) across simulated and real cluster environments. DeepAlloc models the allocation problem as a Markov Decision Process in which the state space encodes current cluster utilization, pending job queue characteristics, and tenant SLA parameters, while the action space encompasses CPU core assignment, memory quota setting, and network bandwidth reservation decisions. We evaluate DeepAlloc against four baseline schedulers -- FIFO, Shortest Job First, DRF, and Kubernetes default -- using both a simulator driven by 14 months of production trace data from a large cloud provider and a 200-node physical testbed. DeepAlloc achieves 23% higher cluster utilization, 41% reduction in SLA violation rate, and 18% lower mean job completion time compared to the best-performing baseline. We characterize the training stability challenges specific to cluster scheduling environments and introduce curriculum learning and action masking techniques that reduce policy collapse incidence by 87%. This work demonstrates the practical viability of deep RL as a production scheduling substrate.

Chukwuebuka Obi, Maja Lindgren, Ryusei Tanaka, Fatima Benali· Aug 2016· 388 citations

Infrastructure as Code: Principles, Patterns, and Pitfalls in Cloud-Native DevOps Environments

Deep Reinforcement Learning for Adaptive Resource Allocation in Multi-Tenant Cloud Data Centers: Architecture, Training Regimes, and Production Evaluation

Registration Required