How a Data Lakehouse Works—And When to Use It

A data lakehouse sits at the intersection of two older architectures that each solved half the problem. By the end of this guide you'll understand exactly what a lakehouse is, how the underlying technology makes it work, which table format to choose, and how to decide whether you actually need one.

The Problem That Lakehouses Solve

To understand why lakehouses exist, you need to understand what came before — and what each approach got wrong.

Data warehouses (Snowflake, BigQuery, Redshift) are purpose-built for SQL analytics. They store data in a proprietary columnar format, enforce schemas, support ACID transactions, and deliver fast query performance. The catch: they're expensive to store data in, they're walled gardens (you can't run Python ML workloads directly on the data), and extracting large datasets for training or batch processing is slow and costly.

Data lakes (raw files in S3, GCS, or Azure Blob Storage) solved the cost and openness problems. Object storage costs pennies per gigabyte. Any tool — Spark, Python, R, Presto — can read directly from it. You can store raw logs, images, JSON, Parquet, and CSV all in the same bucket. The catch: no transactions, no schema enforcement, no versioning. Two writers touching the same dataset simultaneously will corrupt it. Queries are slow because nothing knows which files are relevant. Deleting specific rows requires rewriting entire files.

The lakehouse architecture adds a thin transactional metadata layer on top of cheap object storage, giving you warehouse-like reliability and performance while keeping lake-like openness and economics. It's not a new storage system — it's a new way of organizing and tracking what's already in your object storage.

The Three Components

Every lakehouse has three distinct layers. Understanding what each one does makes the whole system click.

1. Object Storage — The Foundation

Your actual data lives in S3-compatible object storage as columnar files, almost always Apache Parquet. Parquet stores data column-by-column rather than row-by-row, which means a query asking for three columns out of fifty only reads those three — dramatically reducing I/O. It also compresses very efficiently because similar values end up adjacent.

Object storage is append-only by design. You can write new files and delete old ones, but you can't edit a file in place. This immutability is actually what makes lakehouses possible: since files never change, readers can safely cache them and writers can work with new files without disturbing existing readers.

2. The Table Format — The Brains

The table format is what turns a bucket of Parquet files into something that behaves like a database table. It's a metadata layer — a set of files alongside your data files — that tracks:

Which Parquet files currently belong to this table (the "snapshot")
The schema (column names, types, nullability)
Partition information (which files contain which date ranges, regions, etc.)
A transaction log of every change ever made
Statistics per file (min/max values per column, row counts)

The three dominant table formats are Apache Iceberg, Delta Lake, and Apache Hudi. Iceberg has become the de facto open standard — it's supported natively by Snowflake, BigQuery, Athena, Redshift, Spark, Trino, Flink, and DuckDB. Delta Lake was created by Databricks and remains tightly integrated with their platform. Hudi was built at Uber and excels at streaming upserts but has a steeper learning curve.

For new projects in 2026, Iceberg is the default choice unless you're already deep in the Databricks ecosystem.

3. The Query Engine — The Worker

The query engine reads the table format metadata, figures out which files it needs, and executes your SQL or DataFrame operation. The same Iceberg table can be queried by multiple engines simultaneously: Spark for heavy batch jobs, Trino for ad-hoc SQL, DuckDB for local development, and your data warehouse (Snowflake, BigQuery) via their native Iceberg connectors.

This multi-engine access is the key practical advantage. Your data science team can use Spark for model training. Your analysts can use Snowflake for dashboards. Your engineers can use Trino for data pipelines. They're all reading the same table, with the same schema, seeing the same consistent state.

How ACID Transactions Work Without a Database

This is the part most explanations skip, and it's the key to understanding why lakehouses are reliable.

Traditional databases use locks: writer A acquires a lock, makes changes, releases the lock. This works when a database process controls the storage. It doesn't work with object storage, where S3 has no concept of locks and multiple compute clusters can write simultaneously.

Iceberg solves this with optimistic concurrency control and atomic commits. Here's the sequence for a write operation:

The writer reads the current snapshot (a metadata file listing all active data files)
It writes new Parquet data files to object storage
It writes a new metadata file describing the updated state (new snapshot + new files)
It attempts to atomically swap the "current snapshot" pointer to the new metadata file
If another writer committed first (the pointer changed), this writer detects the conflict, discards its metadata file, and retries from step 1

The atomic swap in step 4 is implemented using a catalog (AWS Glue, Nessie, Hive Metastore, or a REST catalog) that supports compare-and-swap operations. This gives you serializable isolation — readers always see a consistent snapshot, and conflicting writes fail fast rather than silently corrupting data.

Time Travel and Why It Matters

Because Iceberg keeps the full transaction log, you can query any previous snapshot of your table:

-- Query the table as it existed yesterday
SELECT * FROM orders FOR SYSTEM_TIME AS OF TIMESTAMP '2026-04-18 00:00:00';

-- Query a specific snapshot by ID
SELECT * FROM orders VERSION AS OF 8274658234987;

This has practical consequences beyond debugging. You can reproduce ML training datasets exactly as they were on a specific date. You can audit what data a regulatory report was based on. You can roll back a bad write without restoring from backup — just point the current snapshot pointer back to a previous one.

Snapshots accumulate over time. The EXPIRE SNAPSHOTS procedure (run periodically via a scheduled job) removes old snapshots and their associated data files, reclaiming storage.

Partition Pruning and File Statistics

Query performance in a lakehouse depends heavily on how well the query engine can skip irrelevant files. Iceberg enables this through two mechanisms.

Partition pruning: When you partition a table by date, Iceberg records which files contain which date ranges in the metadata. A query with WHERE event_date = '2026-04-19' only reads files from that partition — it never touches data from other dates.

Column statistics: Iceberg stores per-file min/max values and null counts for every column. A query with WHERE user_id = 12345 on an unpartitioned table can still skip most files if 12345 falls outside the min/max range recorded for a file.

Together, these let a well-organized Iceberg table approach data warehouse query speeds on analytical workloads, without the proprietary storage format or per-query cost model.

Schema Evolution Without Breaking Downstream Consumers

One persistent headache with raw data lakes is schema drift — someone adds a column to a Kafka topic, and suddenly half your Spark jobs fail. Iceberg handles schema changes transactionally.

You can add columns, drop columns, rename columns, reorder columns, and widen types (e.g., int to long) — all without rewriting existing data files and without breaking readers that haven't been updated yet. Old readers simply ignore columns they don't recognize. New readers see the full schema.

The schema is versioned in the metadata, so you can inspect the exact schema that was in place at any point in history — useful for debugging pipelines that break after a schema change.

When to Use a Lakehouse

A lakehouse is the right choice when you have several of these conditions:

Scale: You're storing hundreds of gigabytes to petabytes. Below ~100 GB, DuckDB reading Parquet files directly is simpler and often faster.
Mixed workloads: Your data serves both SQL analytics and Python/ML workloads. If it's purely SQL analytics, a managed warehouse is simpler to operate.
Multiple teams and tools: Different teams use different engines (Spark, Trino, Snowflake). You want them reading the same data without copies or complex ETL between systems.
Mutable data: You need to update or delete rows — GDPR erasure requests, correcting errors, late-arriving records that need upserts. Raw data lakes can't do this cleanly.
Audit and reproducibility requirements: You need to prove what your data looked like at a specific point in time.

When Not to Use a Lakehouse

The lakehouse model adds operational complexity. Don't default to it when simpler options fit.

Small datasets: If your data fits in a single Postgres instance or a DuckDB file, a lakehouse adds coordination overhead for no benefit. A table format managing fifty Parquet files in S3 is overkill when a single 2 GB Parquet file and a cron job would work.

Pure operational workloads: If you're running an application database — serving user requests, processing transactions in real time — use a proper OLTP database (Postgres, MySQL, CockroachDB). Lakehouses are designed for analytical reads, not millisecond operational queries.

Simple BI only: If your entire use case is dashboards and SQL reports, a managed warehouse (Snowflake, BigQuery, Redshift) is operationally simpler. You don't manage compaction, snapshot expiry, or catalog infrastructure. The query optimizer is purpose-built and the cost model is predictable.

Streaming-first workloads: If your primary requirement is real-time analytics at sub-second latency, use Apache Flink with Kafka or Apache Pinot for the hot layer, with a lakehouse backing the historical analysis tier.

The Minimal Production Stack

If you're starting fresh, here's the simplest production-ready setup:

Storage: S3 (or GCS / Azure Blob)
Table format: Apache Iceberg
Catalog: AWS Glue (if on AWS) or a self-hosted REST catalog (Apache Polaris or Nessie)
Batch processing: Apache Spark on EMR, Dataproc, or Databricks
Ad-hoc SQL: Athena (serverless, reads Iceberg natively) or Trino
BI / dashboards: Connect Snowflake or BigQuery to your Iceberg tables via their native Iceberg reader integrations
Orchestration: Apache Airflow or Prefect to schedule compaction, snapshot expiry, and ingestion pipelines

You don't need all of this on day one. Start with S3 + Iceberg + Spark for writes + Athena for reads. Add components as actual workloads demand them.

The One Operational Task You Cannot Skip

Compaction. Over time, incremental writes produce many small Parquet files. A table with millions of 10 KB files will be slow to query because every query must open thousands of files. The rewrite_data_files procedure (Iceberg's term) combines small files into larger ones — typically targeting 128–512 MB per file — without downtime and without data loss.

Schedule compaction during off-peak hours, daily or weekly depending on your write volume. Most managed platforms (Databricks, AWS Lake Formation) handle this automatically. If you're self-managing, wire it into your orchestration pipeline from the start — not as an afterthought when query performance degrades six months later.

Key Takeaways

A data lakehouse is object storage plus a table format (Iceberg, Delta Lake, or Hudi) that adds ACID transactions, schema management, time travel, and partition statistics on top. The query engine reads the metadata to skip irrelevant files and execute efficiently. Multiple engines can read the same table simultaneously without coordination overhead.

Use it when you have scale, mixed workloads, multiple teams, or mutable data. Skip it when your data is small, your workload is purely operational, or SQL analytics is your only use case.

The technology has matured significantly. Iceberg is now supported natively by every major cloud warehouse and query engine. The main cost of a lakehouse in 2026 is not the tooling — it's the operational discipline: managing compaction, snapshot expiry, and catalog infrastructure. Plan for that upfront, and the architecture rewards you with flexibility and cost efficiency that no proprietary warehouse can match at scale.

How a Data Lakehouse Actually Works — And When to Use One

The Problem That Lakehouses Solve

The Three Components

1. Object Storage — The Foundation

2. The Table Format — The Brains

3. The Query Engine — The Worker

How ACID Transactions Work Without a Database

Time Travel and Why It Matters

Partition Pruning and File Statistics

Schema Evolution Without Breaking Downstream Consumers

When to Use a Lakehouse

When Not to Use a Lakehouse

The Minimal Production Stack

The One Operational Task You Cannot Skip

Key Takeaways

Topics

More

Follow

The Problem That Lakehouses Solve

The Three Components

1. Object Storage — The Foundation

2. The Table Format — The Brains

3. The Query Engine — The Worker

How ACID Transactions Work Without a Database

Time Travel and Why It Matters

Partition Pruning and File Statistics

Schema Evolution Without Breaking Downstream Consumers

When to Use a Lakehouse

When Not to Use a Lakehouse

The Minimal Production Stack

The One Operational Task You Cannot Skip

Key Takeaways

Related Reading

Why Big Data Platforms Are a Waste of Money

Why Cloud Cost Optimization Is a Waste of Time

Observability Debt and the Logging Gaps Nobody Fixes

Topics

More

Follow