Research Archive

Journal Article Subscription Data Engineering

Real-Time Stream Processing Architectures for High-Frequency Financial Data: Latency-Throughput Trade-offs in Apache Flink, Apache Spark Streaming, and Apache Storm

High-frequency financial data processing -- encompassing market tick data, order book events, and payment transaction streams -- imposes latency and throughput requirements at the boundary of what commodity stream processing frameworks can sustain, making architectural choices consequential for both business outcomes and infrastructure cost. This paper presents a rigorous comparative evaluation of three leading distributed stream processing frameworks -- Apache Flink, Apache Spark Structured Streaming, and Apache Storm -- under financial workload conditions. We design a benchmark suite comprising three representative financial workloads: sub-millisecond tick data aggregation, real-time fraud detection over payment event streams, and order book reconstruction with market microstructure analytics. Benchmarks are executed on standardized 24-node clusters across AWS, simulating peak trading session loads of up to 8 million events per second. Apache Flink achieves the lowest median end-to-end latency at 3.2ms for tick aggregation, compared to 12.1ms for Spark Structured Streaming and 8.7ms for Storm. Spark achieves the highest sustained throughput at 11.2M events/second before degradation. We introduce the Stream Processing Fitness Score (SPFS) that aggregates latency percentiles, throughput ceiling, fault recovery time, and operational complexity. We also characterize watermarking strategies, state backend selection, and checkpointing frequency as the three most impactful configuration decisions affecting latency under production conditions.

Chidi Okonkwo, Ingrid Holm, Hiroshi Matsuda, Leila Benali· Nov 2018· 356 citations

Journal Article Subscription Software Engineering

Observability-Driven Development: Rethinking Monitoring Strategies in Distributed Microservices Architectures Under DevOps

As software systems migrate from monolithic architectures to distributed microservices, traditional monitoring approaches centered on threshold-based alerting have become inadequate for maintaining system reliability. This paper introduces and formalizes the concept of Observability-Driven Development (ODD), a methodology that embeds observability instrumentation — comprising structured logging, distributed tracing, and multi-dimensional metrics — as a first-class engineering concern throughout the software development lifecycle. We present a longitudinal study of four organizations that adopted ODD practices over 18 months, measuring impacts on mean time to detect (MTTD), mean time to resolve (MTTR), and on-call engineer cognitive load. ODD adoption reduced MTTD by an average of 74% and MTTR by 58% compared to pre-adoption baselines. We further introduce the Observability Maturity Continuum (OMC), a five-level model characterizing organizations progression from ad-hoc logging to predictive anomaly detection. Practical implementation guidance using OpenTelemetry, Prometheus, and Jaeger is provided. This work reframes observability not as an operational afterthought but as an architectural discipline with measurable business consequences.

Sofia Reyes-Alvarado, Tobias Winkler, Olumide Adeyemi, Hannah Park· Nov 2018· 398 citations

Journal Article Open Access Bioinformatics

Deep Learning Architectures for Genomic Variant Pathogenicity Prediction: Evaluation of CNN, LSTM, and Attention-Based Models on ClinVar and gnomAD Population Databases

The clinical interpretation of genomic variants of uncertain significance (VUS) is one of the most pressing bottlenecks in genomic medicine, with over 60 percent of variants identified in clinical sequencing classified as uncertain significance in the ClinVar database. Machine learning approaches to variant pathogenicity prediction offer the potential to reduce this uncertainty, but the relative merits of different deep learning architectures for this task -- and the generalizability of published models across population diversity -- remain incompletely understood. This paper presents a systematic evaluation of four deep learning architectures for variant pathogenicity prediction: one-dimensional CNN with nucleotide sequence context, bidirectional LSTM with epigenomic feature integration, transformer with self-attention over genomic windows, and a novel hybrid CNN-Transformer architecture we term VariantNet. Evaluation uses a benchmark dataset of 48,000 pathogenic and benign variants curated from ClinVar with gnomAD population frequency annotations, stratified by variant type (SNV, indel, splice-site) and ancestry group. VariantNet achieves the highest AUC (0.943) on the combined benchmark, with particularly strong performance on splice-site variants (AUC 0.961) where sequential context is most informative. A critical finding is significant performance degradation for all models on African ancestry variants (mean AUC drop of 0.041) due to underrepresentation in training data, which we address through ancestry-stratified training with transfer learning. We release VariantNet weights, training code, and the curated benchmark dataset as open-source resources for the bioinformatics community.

Adaeze Obi, Frida Magnusson, Hiromi Yamamoto, Yasmin Hassan· Aug 2018· 467 citations

Journal Article Subscription Software Architecture

Microservices Decomposition Strategies and Their Operational Consequences in DevOps Environments: Domain-Driven Design, Bounded Contexts, and Service Granularity

The decision to decompose a system into microservices is one of the most consequential architectural choices a DevOps organization makes, yet the criteria governing appropriate service granularity remain poorly defined in both academic and practitioner literature. This paper examines microservices decomposition strategies and their downstream operational consequences for DevOps pipeline complexity, observability overhead, inter-service coordination cost, and deployment independence. We conduct a retrospective analysis of decomposition decisions at six organizations over three-year time horizons, supplemented by a survey of 267 software architects and DevOps engineers. Domain-Driven Design (DDD) bounded contexts are used as the theoretical lens, and we evaluate how closely organizations` decomposition decisions align with DDD principles and how alignment correlates with operational outcomes. Organizations with DDD-aligned decompositions report 47% lower inter-service incident rates and 38% fewer deployment pipeline interdependencies compared to organizations using ad-hoc decomposition heuristics. We identify five granularity anti-patterns — Nano-service Proliferation, Shared Database Coupling, Chatty Service Mesh, God Service Regression, and Temporal Coupling Latency — and provide detection heuristics and refactoring guidance for each. The paper provides a practical decomposition decision framework integrating DDD, operational complexity, and DevOps pipeline cost dimensions.

Emeka Nwosu, Annika Bergman, Takashi Suzuki, Ana Cristina Pires· Aug 2018· 362 citations

Journal Article Open Access Cybersecurity

Integrating Security into DevOps: Empirical Assessment of DevSecOps Adoption Barriers and Enablers in Financial Services Organizations

The integration of security practices into DevOps pipelines — commonly termed DevSecOps or "shifting security left" — has attracted significant practitioner interest, yet academic understanding of the organizational dynamics that enable or impede this integration remains nascent. This paper reports findings from a grounded theory study conducted across nine financial services organizations undergoing DevSecOps transformation. Data was collected through 64 interviews with security engineers, DevOps leads, compliance officers, and CISOs, supplemented by documentary analysis of security policy artifacts and incident logs spanning 24 months. Our analysis yielded a substantive theory of DevSecOps adoption organized around three core categories: Security-Development Trust Formation, Toolchain Convergence, and Regulatory Constraint Navigation. We find that the predominant barrier to DevSecOps adoption is not technical but relational: the adversarial framing historically embedded between security and development teams. Organizations that successfully dissolve this framing through shared ownership models and joint blameless post-mortems exhibit twice the rate of automated security gate adoption. The paper contributes an empirically grounded theoretical model and a set of practitioner interventions for accelerating DevSecOps adoption in regulated industries.

Nadia Okonkwo, Lars Bergström, Mei-Ling Chen, Arjun Patel· May 2018· 521 citations

Journal Article Open Access Blockchain

Smart Contract Vulnerability Analysis: Automated Detection of Reentrancy, Integer Overflow, and Access Control Flaws in Ethereum Solidity Codebases

Smart contracts deployed on public blockchain platforms such as Ethereum execute autonomously and immutably, meaning that security vulnerabilities discovered post-deployment cannot be patched without costly migration procedures -- a constraint that elevates pre-deployment security analysis to critical importance. This paper presents SmartGuard, a hybrid static-symbolic analysis framework for automated detection of smart contract vulnerabilities, evaluated against a dataset of 48,000 verified Solidity contracts drawn from the Ethereum mainnet. SmartGuard combines abstract syntax tree analysis, control flow graph construction, and bounded symbolic execution to detect six vulnerability classes: reentrancy, integer overflow and underflow, timestamp dependence, unprotected self-destruct, access control misconfigurations, and front-running susceptibility. On a labeled benchmark of 2,400 contracts with ground-truth vulnerability annotations, SmartGuard achieves 91.2% precision and 87.6% recall for reentrancy detection and 88.4% precision and 83.1% recall averaged across all six vulnerability classes, outperforming Mythril, Slither, and Oyente on four of six categories. We analyze the 48,000 mainnet contracts and find that 23.4% contain at least one high-severity vulnerability, with integer overflow (14.1%) and access control misconfiguration (9.3%) being the most prevalent. We release SmartGuard as an open-source tool and discuss implications for smart contract audit workflows and DeFi protocol governance.

Obiora Okeke, Sofia Lindqvist, Kenji Nakamura, Yasmin Hassan· May 2018· 498 citations

Journal Article Open Access Software Engineering

Site Reliability Engineering Practices in DevOps Organizations: Service Level Objectives, Error Budgets, and the Reliability-Velocity Trade-off

Site Reliability Engineering (SRE), as formalized by Google, proposes a principled framework for managing the tension between system reliability and deployment velocity through the use of Service Level Objectives (SLOs) and error budgets. Despite widespread adoption of SRE terminology, rigorous empirical investigation of how organizations operationalize SRE principles — and with what outcomes — remains limited. This paper presents findings from a cross-sectional study of 22 organizations that have formally adopted SRE practices, using surveys (n=341), pipeline instrumentation data analysis, and structured interviews with SRE team leads. We find significant heterogeneity in SRE implementation: only 38% of organizations claiming SRE adoption have defined SLOs with error budget enforcement mechanisms; the remainder use SLO-like metrics purely for dashboarding without consequential decision-making authority. Organizations with enforced error budgets exhibit statistically significant reductions in both critical incident frequency (–44%) and deployment-related rollbacks (–39%) compared to SRE-nominal organizations. We introduce the SRE Implementation Fidelity Score (SIFS) to characterize the gap between claimed and operational SRE maturity, and demonstrate its predictive validity against reliability outcomes. We also examine the organizational design question of embedded versus centralized SRE teams, finding that embedded models achieve faster incident response but higher knowledge fragmentation.

Chiamaka Eze, Lars Eriksson, Yosuke Fujita, Beatriz Almeida· Feb 2018· 487 citations

Journal Article Subscription Distributed Systems

Consensus Algorithm Performance in Byzantine Fault-Tolerant Distributed Systems: Comparative Analysis of PBFT, HotStuff, and Tendermint Under Adversarial Network Conditions

Byzantine Fault Tolerant (BFT) consensus algorithms are foundational to the correctness of distributed ledger systems, permissioned blockchain networks, and replicated state machines in adversarial environments. The theoretical properties of leading BFT protocols are well-established, yet their comparative performance under realistic network adversary models -- including network partitions, message delays, and active Byzantine behavior -- remains undercharacterized in empirical literature. This paper presents a controlled experimental evaluation of three BFT consensus protocols -- Practical BFT (PBFT), HotStuff, and Tendermint -- across five adversary scenario categories: crash failures only, Byzantine equivocation, network partition (minority and majority), variable message delay (50ms-2000ms), and compound adversarial conditions. Experiments are conducted on a 100-node WAN testbed spanning AWS regions in three continents. HotStuff achieves the highest throughput (12,400 TPS) under benign conditions and the most graceful throughput degradation under Byzantine equivocation attacks (47% throughput retention at f=10 faulty nodes). PBFT exhibits the lowest latency at low node counts (4-node median finality 98ms) but degrades superlinearly with cluster size. Tendermint demonstrates the best liveness under network partition conditions due to its timeout-based leader rotation. We introduce the BFT Protocol Resilience Score (BPRS) and provide a protocol selection matrix mapping deployment scenario characteristics to optimal protocol choice.

Obinna Eze, Marcus Bergstrom, Kenji Yoshida, Leila El-Amin· Feb 2018· 412 citations