Research Archive

Journal Article Subscription Data Engineering

Real-Time Stream Processing Architectures for High-Frequency Financial Data: Latency-Throughput Trade-offs in Apache Flink, Apache Spark Streaming, and Apache Storm

High-frequency financial data processing -- encompassing market tick data, order book events, and payment transaction streams -- imposes latency and throughput requirements at the boundary of what commodity stream processing frameworks can sustain, making architectural choices consequential for both business outcomes and infrastructure cost. This paper presents a rigorous comparative evaluation of three leading distributed stream processing frameworks -- Apache Flink, Apache Spark Structured Streaming, and Apache Storm -- under financial workload conditions. We design a benchmark suite comprising three representative financial workloads: sub-millisecond tick data aggregation, real-time fraud detection over payment event streams, and order book reconstruction with market microstructure analytics. Benchmarks are executed on standardized 24-node clusters across AWS, simulating peak trading session loads of up to 8 million events per second. Apache Flink achieves the lowest median end-to-end latency at 3.2ms for tick aggregation, compared to 12.1ms for Spark Structured Streaming and 8.7ms for Storm. Spark achieves the highest sustained throughput at 11.2M events/second before degradation. We introduce the Stream Processing Fitness Score (SPFS) that aggregates latency percentiles, throughput ceiling, fault recovery time, and operational complexity. We also characterize watermarking strategies, state backend selection, and checkpointing frequency as the three most impactful configuration decisions affecting latency under production conditions.

Chidi Okonkwo, Ingrid Holm, Hiroshi Matsuda, Leila Benali· Nov 2018· 356 citations

Journal Article Subscription Software Engineering

Observability-Driven Development: Rethinking Monitoring Strategies in Distributed Microservices Architectures Under DevOps

As software systems migrate from monolithic architectures to distributed microservices, traditional monitoring approaches centered on threshold-based alerting have become inadequate for maintaining system reliability. This paper introduces and formalizes the concept of Observability-Driven Development (ODD), a methodology that embeds observability instrumentation — comprising structured logging, distributed tracing, and multi-dimensional metrics — as a first-class engineering concern throughout the software development lifecycle. We present a longitudinal study of four organizations that adopted ODD practices over 18 months, measuring impacts on mean time to detect (MTTD), mean time to resolve (MTTR), and on-call engineer cognitive load. ODD adoption reduced MTTD by an average of 74% and MTTR by 58% compared to pre-adoption baselines. We further introduce the Observability Maturity Continuum (OMC), a five-level model characterizing organizations progression from ad-hoc logging to predictive anomaly detection. Practical implementation guidance using OpenTelemetry, Prometheus, and Jaeger is provided. This work reframes observability not as an operational afterthought but as an architectural discipline with measurable business consequences.

Sofia Reyes-Alvarado, Tobias Winkler, Olumide Adeyemi, Hannah Park· Nov 2018· 398 citations

Real-Time Stream Processing Architectures for High-Frequency Financial Data: Latency-Throughput Trade-offs in Apache Flink, Apache Spark Streaming, and Apache Storm

Observability-Driven Development: Rethinking Monitoring Strategies in Distributed Microservices Architectures Under DevOps

Registration Required