Research Archive

Journal Article Subscription Software Architecture

Evolutionary Architecture Governance: Managing Technical Debt, Architectural Fitness Functions, and Incremental Modernization in Long-Lived Enterprise Systems

Enterprise software systems routinely operate for decades beyond their initial design lifetime, accumulating technical debt that progressively impedes feature delivery, increases operational risk, and raises maintenance costs. Architectural modernization -- the structured evolution of such systems toward contemporary architectural patterns -- is one of the highest-stakes and least-understood challenges in software engineering practice. This paper presents a longitudinal study of architectural modernization programs at six large enterprises, tracking architecture evolution decisions, fitness function definition and measurement, and technical debt quantification over a 3-year observation period. Drawing on 88 interviews and quarterly architecture review artifact analysis, we develop a grounded theory of Evolutionary Architecture Governance (EAG), comprising three core practices: Continuous Fitness Function Monitoring (automated measurement of architectural properties such as coupling metrics, cyclomatic complexity, and deployment independence), Technical Debt Heat Mapping (priority-weighted visualization of debt concentration across system components), and Strangler Fig-Guided Incremental Modernization (structured extraction of bounded functionality from monolithic cores into independently deployable units). Organizations implementing all three EAG practices demonstrate 61% higher architecture conformance rates and 44% lower severity-1 incident rates attributable to architectural violations compared to organizations relying on periodic architecture review cycles. We provide an EAG implementation toolkit and a validated architectural fitness function library spanning 24 properties.

Seun Bello, Anna Magnusson, Shuji Watanabe, Catarina Rodrigues· Nov 2022· 287 citations

Journal Article Open Access Cybersecurity

Supply Chain Security in DevOps: Taxonomy, Risk Assessment, and Automated Mitigation Strategies for Software Dependency Vulnerabilities

High-profile software supply chain attacks — most notably the SolarWinds SUNBURST incident and the Log4Shell vulnerability — have exposed critical security gaps in DevOps pipelines that rely on third-party and open-source dependencies. This paper provides a comprehensive treatment of software supply chain security within the context of DevOps, presenting a threat taxonomy, a quantitative risk assessment methodology, and a suite of automated mitigation strategies. We analyze 1,240 software supply chain incidents reported between 2018 and 2022, categorizing attack vectors across six dimensions: dependency confusion, typosquatting, compromised maintainer accounts, malicious commit injection, build pipeline compromise, and artifact tampering. We introduce the Supply Chain Risk Score (SCRS), which aggregates dependency provenance, maintainer reputation, patch velocity, and transitive exposure into a single risk signal consumable by CI/CD gates. We evaluate the SCRS against a holdout dataset of 180 known malicious packages, achieving 87.4% detection precision at 92.1% recall. We further describe an SBOM-integrated DevSecOps reference architecture implementing SLSA Level 3 attestation and demonstrate its deployment in a Fortune 500 organization. This work provides both theoretical grounding and concrete engineering guidance for addressing supply chain threats in high-velocity delivery environments.

Ngozi Eze-Williams, Maximilian Bauer, Sora Kim, Abdul-Rahman Hassan· Oct 2022· 537 citations

Journal Article Open Access Natural Language Processing

Retrieval-Augmented Generation for Knowledge-Intensive Enterprise Tasks: Dense Retriever Design, Index Freshness Management, and Faithfulness Evaluation in Production RAG Systems

Retrieval-Augmented Generation (RAG) -- the combination of dense retrieval over a knowledge corpus with generative language model synthesis -- has emerged as the leading architectural pattern for grounding large language model outputs in verifiable factual sources, addressing the hallucination and knowledge staleness limitations of parametric LLM knowledge. Despite rapid practitioner adoption, the engineering design space of production RAG systems -- covering retriever architecture selection, embedding model choice, index freshness management, chunking strategy, context window utilization, and faithfulness evaluation -- lacks systematic empirical treatment. This paper presents the first comprehensive engineering study of production RAG systems, evaluating design decisions across 14 dimensions using a standardized enterprise question answering benchmark comprising 8,400 questions across legal, financial, and technical documentation corpora. We find that bi-encoder dense retrievers (DPR, E5-large) outperform BM25 sparse retrieval by 18.4 F1 points on complex multi-hop questions but underperform by 7.2 points on exact keyword lookup queries, motivating hybrid retrieval as the default architecture. Chunk size has the highest sensitivity of any single design parameter -- optimal chunk size varies by 4x across corpora depending on document structure. We introduce the RAG Faithfulness Score (RFS), a composite metric measuring citation accuracy, claim groundedness, and context utilization efficiency, and demonstrate its correlation with downstream user trust ratings (Pearson r=0.74). We release evaluation code, benchmark datasets, and optimal configuration templates for six enterprise RAG deployment profiles.

Obiora Chukwu, Maja Svensson, Yuki Matsumoto, Laila Benali· Jul 2022· 534 citations

Journal Article Open Access Cybersecurity

Policy as Code in DevOps: Automated Governance, Open Policy Agent Integration, and Compliance-as-Code Maturity in Cloud-Native Pipelines

As organizations scale their cloud-native DevOps operations, the manual enforcement of security, compliance, and operational policies becomes a significant bottleneck and audit risk. Policy as Code (PaC) — the expression of organizational policies in machine-readable, version-controlled formats that can be automatically evaluated within CI/CD pipelines — has emerged as a scalable alternative. This paper presents the first systematic academic treatment of Policy as Code in DevOps contexts, combining a systematic literature review, an evaluation of three leading PaC frameworks (Open Policy Agent/Rego, Kyverno, and AWS Config Rules), and an empirical study of four organizations that implemented PaC programs over 12–24 months. We define a Policy as Code Taxonomy covering eight policy domains — identity and access, network security, data classification, resource configuration, software supply chain, cost governance, operational thresholds, and regulatory mapping — and evaluate framework suitability across domains. Organizations with mature PaC implementations achieve 94% automated policy coverage (vs 41% baseline), reduce policy violation escape rate to production by 87%, and report audit preparation time reductions of 65%. We introduce the Policy Coverage Efficiency Score (PCES) as a standardized measure of PaC program maturity and provide a PaC implementation roadmap with phase-specific toolchain recommendations.

Isioma Nwofor, Björn Andersson, Shunsuke Nakajima, Marta Alves· Jul 2022· 344 citations

Journal Article Open Access Artificial Intelligence

Transformer Architecture Optimization for On-Device Inference: Knowledge Distillation, Quantization, and Pruning Strategies for Deploying Large Language Models on Edge Hardware

The deployment of transformer-based large language models on edge devices -- smartphones, embedded systems, and IoT endpoints -- requires model compression techniques that preserve task performance while meeting the memory, compute, and energy constraints of target hardware. This paper presents a systematic empirical study of three compression paradigms -- knowledge distillation, post-training and quantization-aware quantization, and structured and unstructured pruning -- applied to BERT-base, DistilBERT, and GPT-2-medium target models across NLP benchmarks (GLUE, SQuAD) and deployment on Qualcomm Snapdragon 888, Apple A15 Bionic, and STM32H7 microcontroller platforms. We characterize the accuracy-compression trade-off surface for each technique individually and in combination, finding that hybrid pipelines combining 4-bit quantization with structured pruning at 40% sparsity achieve 6.2x model size reduction and 4.1x inference speedup on Snapdragon 888 at less than 3% accuracy degradation on GLUE. On the STM32H7 microcontroller, task-specific distilled models achieve viable inference at 380ms per token under severe 256KB RAM constraints. We introduce the Edge Deployment Efficiency Index (EDEI) that normalizes accuracy retention against inference latency, memory footprint, and energy consumption, and release a reproducible compression pipeline toolkit supporting all three techniques. This work provides the most comprehensive empirical guide to LLM edge deployment to date.

Chukwuemeka Aneke, Sofia Bergqvist, Keiji Matsumoto, Salma Benali· Apr 2022· 478 citations

Journal Article Open Access Software Engineering

Chaos Engineering in Production: Systematic Fault Injection as a DevOps Reliability Practice — Evidence from Microservices Deployments at Scale

Chaos engineering — the discipline of deliberately injecting faults into production systems to uncover latent weaknesses before they cause customer-impacting failures — has matured from an experimental practice pioneered by Netflix into a mainstream reliability engineering methodology. Yet its systematic integration into DevOps workflows and its measured effects on system reliability at scale remain understudied in the academic literature. This paper presents findings from a three-year longitudinal study of chaos engineering adoption across five organizations operating microservices platforms at scale (ranging from 80 to 1,400 services). We analyze 2,847 chaos experiments conducted across these organizations, categorized by fault type, blast radius, hypothesis quality, and outcome. Our analysis shows that well-formulated chaos experiments with defined steady-state hypotheses uncovered actionable weaknesses in 67% of cases. Organizations with mature chaos programs (>50 experiments per quarter) exhibited 78% fewer severity-1 incidents per deployment compared to organizations without chaos programs. We introduce the Chaos Experiment Quality Score (CEQS), a composite metric for assessing experiment design rigor, and demonstrate its correlation with actionable outcome rate. We also identify the three most impactful fault categories — network partition, resource exhaustion, and dependency timeout — accounting for 71% of all discovered weaknesses.

Tunde Afolabi, Ingrid Petersen, Zhou Weiming, Catalina Iorga· Apr 2022· 481 citations

Journal Article Subscription Software Engineering

Event-Driven Architecture and DevOps: Operational Patterns for Kafka-Based Microservices Pipelines, Consumer Group Management, and Schema Registry Governance

Event-driven architectures (EDA) based on distributed streaming platforms such as Apache Kafka have become ubiquitous in high-throughput microservices deployments, yet the DevOps practices required to reliably build, deploy, and operate EDA systems remain poorly documented in the academic literature. This paper characterizes the DevOps operational patterns specific to Kafka-based EDA systems, drawing on case studies of five organizations operating Kafka clusters processing between 500 million and 12 billion events daily. We identify and systematize 19 EDA-specific DevOps patterns organized into four categories: Deployment Patterns (schema-compatible rolling upgrades, consumer lag-aware deployment gates), Observability Patterns (consumer group lag monitoring, dead letter queue alerting, schema compatibility drift detection), Governance Patterns (schema registry lifecycle management, topic naming conventions, retention policy automation), and Resilience Patterns (chaos-tested consumer rebalancing, idempotent consumer design, poison pill handling). We evaluate these patterns against three operational outcome dimensions — message delivery reliability, deployment-induced consumer lag, and schema evolution incident rate — using telemetry data from the case study organizations. Organizations implementing the full pattern set achieve 99.994% message delivery reliability and zero schema-induced consumer failures across 18 months of observation. We provide a Kafka DevOps Maturity Assessment and an open-source toolchain configuration reference.

Oluwatobi Akinola, Stefan Nordström, Yutaka Kimura, Sofia Costa· Feb 2022· 312 citations

Journal Article Open Access Autonomous Systems

Formal Safety Verification for Autonomous Vehicle Decision-Making Systems: Temporal Logic Specification, Model Checking, and Runtime Monitoring of Highway Merge and Intersection Scenarios

The deployment of autonomous vehicles in public road environments requires safety assurances that go beyond empirical testing-based approaches, whose coverage limitations are well-documented in the context of rare but catastrophic edge case scenarios. Formal verification methods -- which mathematically prove that system behavior satisfies specified safety properties for all possible inputs within a defined operating domain -- offer complementary guarantees, but their application to the complex, continuous, and probabilistic decision-making systems of autonomous vehicles presents significant scalability and modeling challenges. This paper presents a formal safety verification framework for autonomous vehicle decision-making, applied to two safety-critical scenarios: highway lane change and merge execution, and unprotected intersection crossing. We formalize safety properties in Signal Temporal Logic (STL) and Linear Temporal Logic (LTL), covering collision avoidance, traffic law compliance, progress guarantees, and passenger comfort bounds. Model checking is performed using Uppaal and SpaceEx for hybrid automaton models of the decision system, while runtime monitoring uses a custom STL monitor integrated into the Robot Operating System (ROS2) execution environment. The framework identifies 7 previously undetected safety violation scenarios in a production-candidate decision system, including a highway merge deadlock under specific sensor degradation and adversarial driver behavior combinations. We discuss the fundamental limitations of formal verification for open-world AV operation and propose a hybrid formal-statistical safety assurance methodology addressing these limitations.

Adaeze Nwosu, Emma Lindgren, Takashi Fujita, Yasmin Mansour· Jan 2022· 378 citations