Research Archive

Journal Article Open Access Data Engineering

Lakehouse Architecture: Unifying Data Lake Flexibility and Data Warehouse Reliability Through Delta Lake, Apache Iceberg, and Apache Hudi Transaction Layers

The traditional separation of enterprise data platforms into analytical data warehouses and raw data lakes -- each optimized for different workload types and managed by distinct teams -- has created organizational and technical friction that impedes time-to-insight for analytical consumers. The Lakehouse architecture, which adds ACID transaction semantics, schema enforcement, and time travel capabilities to data lake storage through open table format layers, promises to unify these paradigms. This paper presents the first systematic academic evaluation of Lakehouse architectures, comparing three leading open table format implementations -- Delta Lake, Apache Iceberg, and Apache Hudi -- across six operational dimensions: ACID transaction correctness, concurrent write throughput, schema evolution flexibility, time travel query performance, storage efficiency, and ecosystem compatibility. Evaluations are conducted on a 50-node Spark cluster processing a 20TB synthetic dataset with real-world distribution characteristics derived from a financial institution data platform. Delta Lake achieves the highest concurrent write throughput (340 transactions/second) and strongest ecosystem compatibility. Iceberg demonstrates superior schema evolution flexibility and cross-engine portability. Hudi delivers the lowest storage overhead for change-heavy workloads through its record-level upsert optimization. We introduce the Lakehouse Platform Fitness Score (LPFS) and provide a selection framework based on workload mix, team expertise, and ecosystem lock-in tolerance.

Adaeze Okonjo, Erik Carlsson, Masahiro Fujita, Ana Lopes· Apr 2021· 436 citations

Journal Article Open Access Software Engineering

Zero-Downtime Deployment Architectures: Blue-Green, Rolling, and Canary Strategies Under Stateful Service Constraints

Zero-downtime deployment is a foundational requirement for high-availability systems, yet achieving it under real-world conditions — involving stateful services, database schema changes, distributed transactions, and session state management — is considerably more complex than the simplified presentations common in practitioner tooling documentation. This paper presents an empirical evaluation of three primary zero-downtime deployment patterns — Blue-Green, Rolling Update, and Canary — across stateful and stateless service categories, using a controlled experimental environment replicating production conditions at a mid-scale e-commerce platform. We measure six deployment outcome dimensions: user-perceived error rate during deployment, rollback latency, resource overhead, data consistency incident rate, deployment duration, and blast radius containment. Blue-Green deployments achieve the fastest rollback (mean 47 seconds) but incur the highest resource overhead (2× baseline). Rolling updates minimize resource overhead but exhibit the highest data consistency incident rate under concurrent schema migration scenarios. Canary deployments offer the best blast radius containment with moderate rollback speed, but require sophisticated traffic routing and observability instrumentation. We introduce a Deployment Pattern Selection Matrix that maps service statefulness, data migration complexity, rollback tolerance, and resource budget to optimal pattern selection. Real-world case evidence from three production deployments is used to validate the matrix.

Seun Adeyemo, Frida Carlsson, Kenji Ishida, Mariana Ferreira· Apr 2021· 369 citations

Lakehouse Architecture: Unifying Data Lake Flexibility and Data Warehouse Reliability Through Delta Lake, Apache Iceberg, and Apache Hudi Transaction Layers

Zero-Downtime Deployment Architectures: Blue-Green, Rolling, and Canary Strategies Under Stateful Service Constraints

Registration Required