Research Archive

Journal Article Open Access Software Engineering

Site Reliability Engineering Practices in DevOps Organizations: Service Level Objectives, Error Budgets, and the Reliability-Velocity Trade-off

Site Reliability Engineering (SRE), as formalized by Google, proposes a principled framework for managing the tension between system reliability and deployment velocity through the use of Service Level Objectives (SLOs) and error budgets. Despite widespread adoption of SRE terminology, rigorous empirical investigation of how organizations operationalize SRE principles — and with what outcomes — remains limited. This paper presents findings from a cross-sectional study of 22 organizations that have formally adopted SRE practices, using surveys (n=341), pipeline instrumentation data analysis, and structured interviews with SRE team leads. We find significant heterogeneity in SRE implementation: only 38% of organizations claiming SRE adoption have defined SLOs with error budget enforcement mechanisms; the remainder use SLO-like metrics purely for dashboarding without consequential decision-making authority. Organizations with enforced error budgets exhibit statistically significant reductions in both critical incident frequency (–44%) and deployment-related rollbacks (–39%) compared to SRE-nominal organizations. We introduce the SRE Implementation Fidelity Score (SIFS) to characterize the gap between claimed and operational SRE maturity, and demonstrate its predictive validity against reliability outcomes. We also examine the organizational design question of embedded versus centralized SRE teams, finding that embedded models achieve faster incident response but higher knowledge fragmentation.

Chiamaka Eze, Lars Eriksson, Yosuke Fujita, Beatriz Almeida· Feb 2018· 487 citations

Journal Article Subscription Distributed Systems

Consensus Algorithm Performance in Byzantine Fault-Tolerant Distributed Systems: Comparative Analysis of PBFT, HotStuff, and Tendermint Under Adversarial Network Conditions

Byzantine Fault Tolerant (BFT) consensus algorithms are foundational to the correctness of distributed ledger systems, permissioned blockchain networks, and replicated state machines in adversarial environments. The theoretical properties of leading BFT protocols are well-established, yet their comparative performance under realistic network adversary models -- including network partitions, message delays, and active Byzantine behavior -- remains undercharacterized in empirical literature. This paper presents a controlled experimental evaluation of three BFT consensus protocols -- Practical BFT (PBFT), HotStuff, and Tendermint -- across five adversary scenario categories: crash failures only, Byzantine equivocation, network partition (minority and majority), variable message delay (50ms-2000ms), and compound adversarial conditions. Experiments are conducted on a 100-node WAN testbed spanning AWS regions in three continents. HotStuff achieves the highest throughput (12,400 TPS) under benign conditions and the most graceful throughput degradation under Byzantine equivocation attacks (47% throughput retention at f=10 faulty nodes). PBFT exhibits the lowest latency at low node counts (4-node median finality 98ms) but degrades superlinearly with cluster size. Tendermint demonstrates the best liveness under network partition conditions due to its timeout-based leader rotation. We introduce the BFT Protocol Resilience Score (BPRS) and provide a protocol selection matrix mapping deployment scenario characteristics to optimal protocol choice.

Obinna Eze, Marcus Bergstrom, Kenji Yoshida, Leila El-Amin· Feb 2018· 412 citations

Site Reliability Engineering Practices in DevOps Organizations: Service Level Objectives, Error Budgets, and the Reliability-Velocity Trade-off

Consensus Algorithm Performance in Byzantine Fault-Tolerant Distributed Systems: Comparative Analysis of PBFT, HotStuff, and Tendermint Under Adversarial Network Conditions

Registration Required