Research Archive

Journal Article Open Access Artificial Intelligence

DevOps in the Era of Generative AI: Reimagining Pipeline Automation, Incident Response, and Knowledge Management with Foundation Models

The rapid proliferation of generative AI foundation models — including large language models, code generation systems, and multimodal agents — is poised to fundamentally reshape DevOps practice across the software delivery lifecycle. This paper presents a horizon study examining how generative AI is being integrated into DevOps toolchains in 2024–2025, synthesizing evidence from 23 practitioner organizations, 41 tool vendor disclosures, and a systematic review of 67 preprint and published studies. We identify six primary integration points at which generative AI is creating substantive capability shifts: automated pipeline configuration generation, natural language infrastructure querying, incident narrative summarization, postmortem synthesis, documentation generation from codebase context, and intelligent on-call assistant agents. Through structured case analysis, we find that organizations deploying GenAI-augmented incident response workflows reduce average time-to-mitigation by 34% and decrease escalation rates by 28%. We introduce the GenAI-DevOps Integration Maturity Model (GDIMM), which characterizes organizational readiness across five levels from ad-hoc LLM use to fully agentic delivery pipelines. We also surface emerging risks — including prompt injection in CI/CD contexts, over-reliance on LLM-generated runbooks, and governance gaps in AI-generated infrastructure code — and propose mitigation design patterns. This paper provides the most comprehensive empirical and conceptual treatment of the GenAI-DevOps intersection to date.

Chukwuemeka Obialo, Nina Brandt, Yuki Hashimoto, Rania Saleh· Jan 2025· 274 citations

Journal Article Open Access Augmented and Virtual Reality

Spatial Computing Interfaces for Collaborative Engineering Design: Task Performance, Cognitive Load, and Design Outcome Quality in Mixed Reality Versus Traditional CAD Workflows

Spatial computing platforms -- particularly mixed reality (MR) headsets such as the Apple Vision Pro and Microsoft HoloLens 2 -- have been proposed as transformative tools for engineering design collaboration, enabling co-located and remote teams to interact with three-dimensional CAD models in their physical workspace context. Despite significant industry investment, rigorous empirical evidence comparing MR-assisted design workflows to established 2D screen-based CAD collaboration on task performance, cognitive load, and design outcome quality remains scarce. This paper presents a mixed-methods study combining a within-subjects randomized controlled experiment (n=84 engineers across mechanical, architectural, and product design specialties) with qualitative protocol analysis to evaluate MR versus screen-based CAD collaboration across three design task categories: design review and defect identification, assembly sequence planning, and cross-functional stakeholder communication. MR collaboration reduces defect identification time by 34 percent and increases defect detection rate by 22 percent for complex assembly structures compared to screen-based review, driven by improved spatial reference frame sharing. For assembly sequence planning, MR achieves equivalent outcome quality with 18 percent lower task completion time. However, MR imposes significantly higher cognitive load (NASA-TLX increase of 31 percent) during text-heavy annotation tasks due to current headset text input limitations. We develop the Spatial Computing Task Suitability Matrix (SCSM) mapping engineering task types to expected MR benefit, and provide evidence-grounded adoption guidance for engineering organizations evaluating spatial computing investments.

Adaeze Nwachukwu, Erik Svensson, Akira Mori, Diana Rodrigues· Jan 2025· 112 citations

Journal Article Open Access Software Engineering

Developer Experience as a Strategic Capability: Measuring, Managing, and Improving DX in High-Performance DevOps Organizations

Developer experience (DX) has emerged as a strategic organizational capability, with growing evidence linking DX quality to software delivery performance, engineer retention, and innovation output. Yet DX measurement frameworks in current use are predominantly anecdotal or survey-based without validated psychometric properties, limiting their utility as management instruments. This paper presents the Developer Experience Engineering Framework (DXEF), which operationalizes DX across five validated dimensions — Feedback Loop Speed, Cognitive Load, Flow State Enablement, Toolchain Friction, and Psychological Safety — and validates the framework through a confirmatory factor analysis study across 1,240 engineers in 34 organizations. DXEF scores exhibit strong criterion validity against DORA four-key metrics (ρ=0.71) and moderate predictive validity against 12-month engineer retention rates (ρ=0.54). Using DXEF longitudinal data, we identify the five highest-leverage DX interventions — CI pipeline speed improvement, local development environment standardization, on-call burden reduction, documentation quality uplift, and deployment process simplification — and quantify their mean impact on delivery performance. Organizations that improve DXEF scores by one standard deviation exhibit deployment frequency improvements of 31% and change failure rate reductions of 24%. We provide the DXEF instrument as an open-access resource for practitioners.

Adaobi Chukwuemeka, Erik Magnusson, Akiko Hayashi, Carla Mendes· Nov 2024· 167 citations

Journal Article Open Access Software Engineering

Green DevOps: Measuring and Reducing the Carbon Footprint of Software Delivery Pipelines in Cloud-Native Environments

The environmental impact of software systems — including the energy consumed by cloud infrastructure, CI/CD pipeline execution, and continuous testing — has become an increasing concern for organizations committed to sustainability goals. This paper introduces Green DevOps, a framework for measuring, attributing, and systematically reducing the carbon footprint of software delivery pipelines. We present methodology for estimating pipeline-level carbon emissions using cloud provider energy intensity data and workload utilization metrics, validated against direct power measurements in a private cloud environment. Applying our methodology across 14 organizations, we find that CI/CD pipeline execution accounts for an average of 23% of an engineering organization`s total cloud carbon budget — a figure largely unrecognized by sustainability reporting frameworks. We identify five high-impact carbon reduction strategies: test suite optimization (average 31% reduction in test carbon), pipeline parallelization efficiency tuning, off-peak scheduling of non-latency-sensitive jobs, spot/preemptible instance adoption for CI workers, and container image minimization. We also propose a CI/CD Carbon Efficiency Score (CES) and demonstrate integration with GitHub Actions and GitLab CI through an open-source emissions monitoring plugin. This work establishes the empirical and methodological foundation for sustainable DevOps practice and provides immediate actionable guidance for practitioners.

Oluwaseun Badejo, Astrid Nilsson, Takuya Nakashima, Giulia Martinelli· Aug 2024· 198 citations

Journal Article Open Access Distributed Systems

Consensus-Free Distributed Transactions: Evaluating CRDT-Based Eventual Consistency, Saga Patterns, and Deterministic Simulation Testing in Geo-Distributed Microservices at Scale

Global-scale distributed systems cannot simultaneously provide strong consistency, high availability, and partition tolerance -- the CAP theorem constraint that has motivated the design of eventually consistent distributed systems and the development of Conflict-Free Replicated Data Types (CRDTs) as a principled approach to safe concurrent updates. Yet the engineering of correct applications on eventually consistent foundations requires careful design of data structures, transaction boundaries, and conflict resolution strategies that remain poorly understood in practice. This paper presents a systematic engineering study of consistency model selection and implementation in geo-distributed microservices, combining theoretical analysis with a controlled experimental platform and a large-scale production case study. We evaluate three architectural patterns -- CRDT-based state replication, choreography-based Saga transactions, and orchestrated Saga with compensating transactions -- for their correctness guarantees, throughput characteristics, and failure recovery behavior across five WAN topology configurations representing realistic global deployment scenarios. CRDT-based replication achieves 99.97 percent convergence within 500 milliseconds under normal network conditions but requires application-level conflict semantics that are incompatible with 23 percent of evaluated business logic patterns. Saga-based transactions provide clearer business logic expression but exhibit a 3.8x higher failure recovery complexity score. We introduce Deterministic Simulation Testing (DST) as the most effective technique for finding consistency bugs (detecting 89 percent of injected faults), and provide an open-source DST framework for distributed systems testing.

Obiageli Fashola, Hanna Magnusson, Daisuke Suzuki, Yasmin El-Masri· Aug 2024· 178 citations

Journal Article Subscription Cloud Computing

Kubernetes Operator Pattern in Production DevOps: Custom Resource Definition Design, Controller Reconciliation Logic, and Operational Lifecycle Management

The Kubernetes Operator pattern — which encodes operational domain knowledge into custom controllers that automate the full lifecycle management of complex stateful applications — has matured from an experimental concept into a production-grade DevOps primitive. Yet the design principles, failure modes, and operational consequences of Operator development remain undercharacterized in the academic literature. This paper presents a systematic analysis of Kubernetes Operator design and operation, combining a review of 47 production-grade open-source Operators with a practitioner survey (n=287) and five organizational case studies. We introduce the Operator Design Quality Framework (ODQF), which evaluates Operators across seven dimensions: reconciliation loop idempotency, status condition expressiveness, owner reference management, leader election correctness, level-triggered vs edge-triggered design, error classification strategy, and observability instrumentation. Analysis of the 47 open-source Operators reveals that 61% exhibit at least one critical ODQF deficiency, with reconciliation non-idempotency and inadequate error classification being the most prevalent. We characterize three operator failure modes — Reconciliation Thrashing, Status Condition Stagnation, and Watch Event Storm — with detection signatures and mitigation patterns for each. Case study evidence demonstrates that teams adopting ODQF-guided development produce Operators with 73% fewer production incidents in the first year post-deployment.

Tochukwu Obi, Sara Lindström, Masashi Okamoto, Cláudia Ferreira· May 2024· 218 citations

Journal Article Open Access Edge Computing

Neuromorphic Computing at the Edge: Energy Efficiency, Spike-Based Processing, and Real-Time Inference on Intel Loihi 2 and BrainScaleS-2 for Sensor Fusion Applications

Neuromorphic computing platforms -- which implement spiking neural networks (SNNs) on hardware architectures that mimic the sparse, event-driven computation of biological neural systems -- offer orders-of-magnitude improvements in energy efficiency for inference workloads compared to GPU and standard CPU-based inference engines. This paper presents a systematic empirical evaluation of two leading neuromorphic platforms -- Intel Loihi 2 and BrainScaleS-2 -- for edge inference applications, with particular focus on sensor fusion tasks in industrial IoT and autonomous robotics contexts. We implement and benchmark five representative sensor fusion workloads -- vibration anomaly detection, multi-modal localization, gesture recognition, event camera object tracking, and acoustic event classification -- on both platforms, measuring inference energy per sample, latency, accuracy relative to floating-point ANN baselines, and programming model usability. Loihi 2 achieves 42x energy reduction relative to Jetson Nano for vibration anomaly detection at 97.8% of ANN baseline accuracy, while BrainScaleS-2 demonstrates 18x speedup for the event camera tracking workload due to its analog emulation substrate. We introduce the Neuromorphic Inference Suitability Score (NISS) and identify the workload characteristics -- sparse temporal input, low-precision weight requirements, and real-time latency criticality -- that most strongly predict neuromorphic advantage over conventional platforms. We release SNN model implementations and training code for all five benchmarks.

Adaora Nwachukwu, Lars Holm, Ryo Yamamoto, Nadia El-Amin· May 2024· 198 citations

Journal Article Open Access Artificial Intelligence

AI-Assisted Code Review in DevOps Pipelines: Empirical Evaluation of Large Language Model Integration for Automated Quality Gates

The integration of large language models (LLMs) into software engineering workflows has generated significant practitioner interest, yet rigorous empirical evaluation of LLM-assisted code review within DevOps pipelines remains limited. This paper presents a controlled empirical study evaluating the effectiveness of GPT-4 and Code Llama as automated code review agents within CI/CD quality gate implementations across three enterprise organizations. Our evaluation uses a benchmark corpus of 8,400 pull requests from production codebases spanning Python, Java, and TypeScript, with ground truth labels established by senior engineers reviewing each PR in a blinded protocol. LLM-based code review agents achieved 84.2% precision and 79.7% recall for security vulnerability identification, outperforming static analysis tools (SAST) on logical vulnerability classes while underperforming on injection-type vulnerabilities. For maintainability feedback, LLM agents produced actionable suggestions in 71% of cases, with engineer acceptance rates of 63% for GPT-4 and 54% for Code Llama. We introduce the Code Review Quality Score (CRQS) to standardize evaluation across dimensions. We also analyze prompt engineering strategies, context window management, and cost-latency trade-offs relevant to CI/CD integration constraints. Our findings provide the most comprehensive empirical assessment of LLM code review integration in DevOps environments to date, offering actionable deployment guidance for practitioners.

Emeka Chukwu, Lara Hoffmann, Takeshi Morikawa, Pooja Nair· Mar 2024· 341 citations

Journal Article Open Access Computer Vision

Vision-Language Models for Industrial Quality Control: Zero-Shot and Few-Shot Defect Detection Using CLIP, GPT-4V, and Gemini Vision in Manufacturing Inspection Pipelines

Industrial visual quality control -- the automated detection and classification of surface defects, dimensional anomalies, and assembly errors in manufactured components -- has traditionally required large labeled training datasets for each new product and defect category, creating deployment friction in high-mix manufacturing environments where products change frequently. Vision-language foundation models, including CLIP, GPT-4V, and Gemini Vision, offer the potential for zero-shot and few-shot defect detection through natural language defect description, potentially eliminating dataset collection requirements for new inspection tasks. This paper presents the first systematic evaluation of vision-language models for industrial defect detection, using a benchmark suite comprising 14,400 images across six manufactured component categories (printed circuit boards, machined metal parts, woven textiles, glass panels, silicon wafers, and food products) with ground-truth defect annotations from domain expert inspectors. CLIP-based zero-shot classification achieves 74.3 percent mean detection accuracy across categories with carefully engineered text prompts, compared to 94.1 percent for fine-tuned ResNet50 on the same categories. GPT-4V few-shot with 5 defect exemplars achieves 88.7 percent accuracy, reducing the gap to supervised learning while requiring no training pipeline. We characterize the prompt engineering patterns that most strongly influence zero-shot detection performance and introduce the Industrial Vision-Language Benchmark (IVLB) as an open evaluation resource. We also analyze the latency and cost profiles of API-based vision-language model deployment in production inspection pipelines.

Emeka Okafor, Sofia Svensson, Keiko Yamamoto, Rania El-Amin· Feb 2024· 234 citations