Zero-ETL Architecture: When Data Does Not Move

The pipeline sprawl that consumed data engineering for a decade is collapsing. Databricks, Snowflake, and AWS now ship direct-integration features that replicate operational data into analytics platforms with no custom code. The question is no longer whether zero-ETL works — it is where it breaks.

The Pipeline Problem

For five years, data engineering meant pipeline operations. Teams built Airflow DAGs, maintained Spark jobs, and debugged brittle connectors that broke dashboards at inconvenient moments. The discipline earned credibility from operational output — getting data to move — rather than architectural design. The resulting stacks worked, but they created a structural dependency: every new source required a new pipeline, every schema change triggered a new incident, and every pipeline became a liability that no one wanted to own.

Data no longer needs to travel to be useful. Open table formats and engine-agnostic querying have removed the structural constraint that made pipelines the default.

That constraint is disappearing. Cheap object storage, open table formats like Apache Iceberg and Delta Lake, and decoupled compute engines mean data can stay where it is and still serve analytics. The "move everything to a warehouse" pattern is no longer the only option — and for many workloads, it is not the best one.

What Zero-ETL Actually Is

Zero-ETL ingests data into an analytics or AI platform without custom ETL pipelines. The flow changes from:

Traditional ETL	Zero-ETL
Source → Extract → Transform → Stage → Load → Warehouse	Source → Managed Integration → Warehouse / Lakehouse
Custom code for each source	Platform-managed connectors and CDC
Schema defined before loading (schema-on-write)	Schema applied at query time (schema-on-read)
Batch windows, scheduled refreshes	Near real-time, continuous replication
Transformation is a pre-load gate	Transformation is deferred to downstream models (dbt, materialized views)

This is not "no data engineering." It is less custom ingestion code and more managed integration. Databricks Native Lakehouse Sync replicates Postgres data into Unity Catalog via CDC without external pipelines. Snowflake's zero-copy SAP integration joins SAP Business Data Cloud with non-SAP data at scale, no duplication required. AWS Glue zero-ETL streams DynamoDB changes directly into data lakes with schema and partition controls.

Zero-ETL removes plumbing, not responsibility. Bad source data arrives faster — which means bad dashboards and hallucinating AI systems arrive faster too.

The Three Architecture Patterns

Zero-ETL is not a single implementation. Three patterns have emerged, each with distinct trade-offs:

Pattern 1: Native Replication (Platform-Managed CDC)

The vendor provides a built-in connector that replicates data from an operational source into its analytics platform. Snowflake's SAP zero-copy connector, AWS Glue zero-ETL for DynamoDB, and Databricks Lakehouse Sync all follow this model. The platform handles schema evolution, change capture, and error recovery.

Where it works: SaaS-to-warehouse replication, operational dashboards, AI feature stores where freshness matters more than transformation complexity.

Where it breaks: Multi-source joins that need pre-computed business logic. Financial reporting requiring validated, auditable transformations. Any environment where the source schema changes frequently and downstream consumers are not prepared for instant propagation.

Pattern 2: Federated Querying (Query-Through)

The analytics engine queries external data sources directly without moving data. Snowflake's external tables, BigQuery's federated queries, and Memgraph Zero's federated graph engine all query data where it lives. The engine handles translation, but no data is persisted locally.

Where it works: Ad-hoc analysis, data exploration, joining infrequently queried external datasets. AI agents that need immediate context from multiple sources without waiting for replication lag.

Where it breaks: High-frequency queries on large external datasets — latency and cost scale quickly. Complex transformations that need intermediate materialization. Compliance requirements for data residency that conflict with where the source lives.

Data providers publish curated datasets that consumers mount directly into their own environment. Snowflake Data Sharing, Databricks synced tables, and AWS Data Exchange enable this pattern. No data copying, no pipelines — consumers query the provider's data in their own account.

Where it works: Consuming third-party reference data (financial, geospatial, demographic). Internal data mesh patterns between business units. Sharing curated data products across organizational boundaries.

Where it breaks: Customization — consumers get the data in the provider's schema and format, not their own. Governance — the provider controls freshness and availability. Cost — query and storage costs accrue in the consumer's account on data they do not own.

The Hidden Costs Nobody Mentions

Schema Drift Hits Faster

When schemas change upstream, they propagate instantly. A column rename in the operational database breaks dashboard queries in minutes, not hours. Under traditional ETL, the pipeline layer acted as a buffer — transformations could absorb schema changes before they reached analytics. Zero-ETL removes that buffer entirely.

The mitigation is data contracts: versioned schema agreements between source and consumer teams, enforced at the integration layer rather than the transformation layer. But data contracts require organizational discipline that most teams have not yet built.

Vendor Lock-In, by Design

Native replication connectors are proprietary. Snowflake's SAP integration, Databricks Lakehouse Sync, and AWS Glue zero-ETL each tie data movement to their own platform. The connector works because it is deeply integrated with a specific storage format, transaction model, and compute engine. Migrating later means rebuilding the entire integration from scratch.

The teams that win will design platforms that serve BI and AI, invest in data quality and observability, use dbt for semantics, and combine zero-ETL for speed with transformation for trust.

Transformation Does Not Disappear

Zero-ETL solutions focus on replication, not business logic. Complex transformations — slowly changing dimensions, multi-source joins, deduplication, and enrichment — still belong in dbt models, materialized views, or Spark jobs downstream. The pipeline count decreases, but the modeling work moves to a different layer of the stack. Teams that skip this step get faster access to raw data they cannot trust.

Cost Surprises

"Zero ETL" does not mean zero billing. Connector pricing, warehouse compute for on-demand transformations, and storage for replicated data all carry costs that accumulate differently than traditional pipeline infrastructure. A team that replaces 50 Airflow DAGs with native connectors may find the total cost similar — it just shifts from engineering time to platform fees.

Data Quality Is Still Your Problem

This is the critical blind spot. Zero-ETL accelerates bad data as efficiently as good data. Stale fields propagate faster. PII leaks into analytics with no transformation layer to strip it. Inconsistent definitions reach dashboards without the cleansing step that previously normalized them. As covered in earlier reporting on data observability, the pipeline layer historically provided a quality checkpoint — remove it, and quality must be enforced elsewhere.

Where Zero-ETL Shines

Use Case	Pattern	Works Well	Watch Out For
AI feature ingestion	Native replication	Fresh data for models, low latency	PII propagation, schema drift in feature stores
Operational dashboards	Native replication	Near real-time metrics, no Airflow jobs	Cost escalates with query frequency
SaaS tool data sync	Native replication	CRM → warehouse in minutes	Vendor-specific schema limits customization
Ad-hoc data exploration	Federated querying	Query anything, no setup	Latency and cost on large datasets
Third-party reference data	Data sharing	Instant access to curated datasets	Provider controls schema and freshness
Financial reporting	Traditional ETL	Audit trail, validated transforms	—
Compliance pipelines	Traditional ETL	PII stripping, lineage tracking	—

The rows with "Watch Out For" entries are where zero-ETL delivers speed but requires governance investment to deliver trust. The two bottom rows — financial reporting and compliance — are still better served by traditional ETL precisely because the transformation layer provides the audit trail and validation that regulations demand.

The Data Engineer Role Shift

Zero-ETL does not eliminate data engineering. It changes the job description. The shift moves from building pipelines to designing platforms:

Old Focus	New Focus
Writing ingestion connectors	Evaluating platform-native integrations
Debugging Airflow DAGs	Managing data quality contracts and observability
Maintaining transformation pipelines	Building semantic layers with dbt
Schema migration scripts	Schema evolution policies and consumer notification
Capacity planning for batch windows	Cost optimization for continuous replication

The discipline shifts from pipeline builder to data platform architect. The skills required — contract negotiation, cost modeling, semantic layer design — are different from orchestration and connector debugging, but they are no less engineering.

This shift is already visible in Databricks positioning Lakehouse Sync as an "activation" feature, not a replication tool. As the agentic data engineering model gains traction, the value proposition is clear: data engineers should own the platform's semantics, cost model, and quality guarantees — not the plumbing that moves bytes between systems.

Exceptions and Limits

Zero-ETL is not universally applicable. Teams working in regulated industries — financial services, healthcare, government — often need the transformation layer for compliance reasons. PII must be stripped before data reaches analytics. Financial figures must be validated through reconciliation. Audit trails require lineage tracking that pipeline code provides inherently.

Organizations with extensive dbt deployments face a different limit. If 80% of analytics value comes from transformed models — slowly changing dimensions, business logic aggregations, cross-source joins — then zero-ETL acceleration of the raw layer does not address the bottleneck. The constraint is transformation complexity, not ingestion speed.

Multi-cloud environments present a third limit. Native replication works within a vendor's ecosystem. Replicating from AWS DynamoDB to GCP BigQuery, or from Azure Cosmos DB to Snowflake on AWS, requires exactly the kind of cross-platform integration that zero-ETL is designed to eliminate — but the connectors are not there yet.

Honest Assessment

Dimension	Zero-ETL	Traditional ETL
Time to first dashboard	Hours to days	Weeks to months
Operational overhead	Low (managed connectors)	High (custom pipelines)
Schema flexibility	High (schema-on-read)	Low (schema-on-write gate)
Data quality controls	Must be built separately	Built into transformation
Vendor portability	Low (proprietary connectors)	Higher (standards-based)
Cost predictability	Variable (compute + replication)	More predictable (scheduled batches)
Audit trail	Minimal (raw replication)	Strong (transformation logs)
Transformation depth	Shallow (deferred to dbt)	Deep (in-pipeline logic)

Actionable Takeaways

Audit your pipeline portfolio. Classify every pipeline by transformation complexity. If 60% or more are simple replication (no business logic, just type casting and deduplication), zero-ETL can replace them.
Start with SaaS replication, not core data. CRM, marketing, and support tool connectors are low-risk entry points. Core financial and compliance data should stay on traditional ETL until governance catches up.
Invest in data contracts before migrating. Versioned schema agreements between source and consumer teams prevent schema-drift incidents. Without them, zero-ETL's instant propagation becomes a liability.
Budget for platform cost modeling. Connector fees, continuous replication compute, and on-demand transformation costs accumulate differently than batch infrastructure. Run a cost comparison before committing.
Layer observability on top immediately. Data observability is not optional when pipelines disappear. Freshness checks, volume anomaly detection, and schema drift alerts must cover the integration points that zero-ETL replaces.
Keep dbt for semantics. Zero-ETL moves data faster. dbt models still define what that data means. The winning pattern is fast ingestion plus trusted transformation — not one without the other.

Zero-ETL Architecture: When Data Does Not Move

The Pipeline Problem

What Zero-ETL Actually Is

The Three Architecture Patterns

Pattern 1: Native Replication (Platform-Managed CDC)

Pattern 2: Federated Querying (Query-Through)

The Hidden Costs Nobody Mentions

Schema Drift Hits Faster

Vendor Lock-In, by Design

Transformation Does Not Disappear

Cost Surprises

Data Quality Is Still Your Problem

Where Zero-ETL Shines

The Data Engineer Role Shift

Exceptions and Limits

Honest Assessment

Actionable Takeaways

Topics

More

Follow

The Pipeline Problem

What Zero-ETL Actually Is

The Three Architecture Patterns

Pattern 1: Native Replication (Platform-Managed CDC)

Pattern 2: Federated Querying (Query-Through)

Pattern 3: Data Sharing and Marketplace

The Hidden Costs Nobody Mentions

Schema Drift Hits Faster

Vendor Lock-In, by Design

Transformation Does Not Disappear

Cost Surprises

Data Quality Is Still Your Problem

Where Zero-ETL Shines

The Data Engineer Role Shift

Exceptions and Limits

Honest Assessment

Actionable Takeaways

Related

How Data Observability Secures AI Pipelines

Agentic Data Engineering: Pipelines That Run Themselves

How a Data Lakehouse Works — And When to Use One

Topics

More

Follow