Zero-ETL Architecture: When Data Does Not Move
The pipeline sprawl that consumed data engineering for a decade is collapsing. Databricks, Snowflake, and AWS now ship direct-integration features that replicate operational data into analytics platforms with no custom code. The question is no longer whether zero-ETL works — it is where it breaks.
The Pipeline Problem
For five years, data engineering meant pipeline operations. Teams built Airflow DAGs, maintained Spark jobs, and debugged brittle connectors that broke dashboards at inconvenient moments. The discipline earned credibility from operational output — getting data to move — rather than architectural design. The resulting stacks worked, but they created a structural dependency: every new source required a new pipeline, every schema change triggered a new incident, and every pipeline became a liability that no one wanted to own.
Data no longer needs to travel to be useful. Open table formats and engine-agnostic querying have removed the structural constraint that made pipelines the default.
That constraint is disappearing. Cheap object storage, open table formats like Apache Iceberg and Delta Lake, and decoupled compute engines mean data can stay where it is and still serve analytics. The "move everything to a warehouse" pattern is no longer the only option — and for many workloads, it is not the best one.
What Zero-ETL Actually Is
Zero-ETL ingests data into an analytics or AI platform without custom ETL pipelines. The flow changes from:
| Traditional ETL | Zero-ETL |
|---|---|
| Source → Extract → Transform → Stage → Load → Warehouse | Source → Managed Integration → Warehouse / Lakehouse |
| Custom code for each source | Platform-managed connectors and CDC |
| Schema defined before loading (schema-on-write) | Schema applied at query time (schema-on-read) |
| Batch windows, scheduled refreshes | Near real-time, continuous replication |
| Transformation is a pre-load gate | Transformation is deferred to downstream models (dbt, materialized views) |
This is not "no data engineering." It is less custom ingestion code and more managed integration. Databricks Native Lakehouse Sync replicates Postgres data into Unity Catalog via CDC without external pipelines. Snowflake's zero-copy SAP integration joins SAP Business Data Cloud with non-SAP data at scale, no duplication required. AWS Glue zero-ETL streams DynamoDB changes directly into data lakes with schema and partition controls.
Zero-ETL removes plumbing, not responsibility. Bad source data arrives faster — which means bad dashboards and hallucinating AI systems arrive faster too.
The Three Architecture Patterns
Zero-ETL is not a single implementation. Three patterns have emerged, each with distinct trade-offs:
Pattern 1: Native Replication (Platform-Managed CDC)
The vendor provides a built-in connector that replicates data from an operational source into its analytics platform. Snowflake's SAP zero-copy connector, AWS Glue zero-ETL for DynamoDB, and Databricks Lakehouse Sync all follow this model. The platform handles schema evolution, change capture, and error recovery.
Where it works: SaaS-to-warehouse replication, operational dashboards, AI feature stores where freshness matters more than transformation complexity.
Where it breaks: Multi-source joins that need pre-computed business logic. Financial reporting requiring validated, auditable transformations. Any environment where the source schema changes frequently and downstream consumers are not prepared for instant propagation.
Pattern 2: Federated Querying (Query-Through)
The analytics engine queries external data sources directly without moving data. Snowflake's external tables, BigQuery's federated queries, and Memgraph Zero's federated graph engine all query data where it lives. The engine handles translation, but no data is persisted locally.
Where it works: Ad-hoc analysis, data exploration, joining infrequently queried external datasets. AI agents that need immediate context from multiple sources without waiting for replication lag.
Where it breaks: High-frequency queries on large external datasets — latency and cost scale quickly. Complex transformations that need intermediate materialization. Compliance requirements for data residency that conflict with where the source lives.
Pattern 3: Data Sharing and Marketplace
Data providers publish curated datasets that consumers mount directly into their own environment. Snowflake Data Sharing, Databricks synced tables, and AWS Data Exchange enable this pattern. No data copying, no pipelines — consumers query the provider's data in their own account.
Where it works: Consuming third-party reference data (financial, geospatial, demographic). Internal data mesh patterns between business units. Sharing curated data products across organizational boundaries.
Where it breaks: Customization — consumers get the data in the provider's schema and format, not their own. Governance — the provider controls freshness and availability. Cost — query and storage costs accrue in the consumer's account on data they do not own.
The Hidden Costs Nobody Mentions
Schema Drift Hits Faster
When schemas change upstream, they propagate instantly. A column rename in the operational database breaks dashboard queries in minutes, not hours. Under traditional ETL, the pipeline layer acted as a buffer — transformations could absorb schema changes before they reached analytics. Zero-ETL removes that buffer entirely.
The mitigation is data contracts: versioned schema agreements between source and consumer teams, enforced at the integration layer rather than the transformation layer. But data contracts require organizational discipline that most teams have not yet built.
Vendor Lock-In, by Design
Native replication connectors are proprietary. Snowflake's SAP integration, Databricks Lakehouse Sync, and AWS Glue zero-ETL each tie data movement to their own platform. The connector works because it is deeply integrated with a specific storage format, transaction model, and compute engine. Migrating later means rebuilding the entire integration from scratch.
The teams that win will design platforms that serve BI and AI, invest in data quality and observability, use dbt for semantics, and combine zero-ETL for speed with transformation for trust.
Transformation Does Not Disappear
Zero-ETL solutions focus on replication, not business logic. Complex transformations — slowly changing dimensions, multi-source joins, deduplication, and enrichment — still belong in dbt models, materialized views, or Spark jobs downstream. The pipeline count decreases, but the modeling work moves to a different layer of the stack. Teams that skip this step get faster access to raw data they cannot trust.
Cost Surprises
"Zero ETL" does not mean zero billing. Connector pricing, warehouse compute for on-demand transformations, and storage for replicated data all carry costs that accumulate differently than traditional pipeline infrastructure. A team that replaces 50 Airflow DAGs with native connectors may find the total cost similar — it just shifts from engineering time to platform fees.
Data Quality Is Still Your Problem
This is the critical blind spot. Zero-ETL accelerates bad data as efficiently as good data. Stale fields propagate faster. PII leaks into analytics with no transformation layer to strip it. Inconsistent definitions reach dashboards without the cleansing step that previously normalized them. As covered in earlier reporting on data observability, the pipeline layer historically provided a quality checkpoint — remove it, and quality must be enforced elsewhere.
Where Zero-ETL Shines
| Use Case | Pattern | Works Well | Watch Out For |
|---|---|---|---|
| AI feature ingestion | Native replication | Fresh data for models, low latency | PII propagation, schema drift in feature stores |
| Operational dashboards | Native replication | Near real-time metrics, no Airflow jobs | Cost escalates with query frequency |
| SaaS tool data sync | Native replication | CRM → warehouse in minutes | Vendor-specific schema limits customization |
| Ad-hoc data exploration | Federated querying | Query anything, no setup | Latency and cost on large datasets |
| Third-party reference data | Data sharing | Instant access to curated datasets | Provider controls schema and freshness |
| Financial reporting | Traditional ETL | Audit trail, validated transforms | — |
| Compliance pipelines | Traditional ETL | PII stripping, lineage tracking | — |
The rows with "Watch Out For" entries are where zero-ETL delivers speed but requires governance investment to deliver trust. The two bottom rows — financial reporting and compliance — are still better served by traditional ETL precisely because the transformation layer provides the audit trail and validation that regulations demand.
The Data Engineer Role Shift
Zero-ETL does not eliminate data engineering. It changes the job description. The shift moves from building pipelines to designing platforms:
| Old Focus | New Focus |
|---|---|
| Writing ingestion connectors | Evaluating platform-native integrations |
| Debugging Airflow DAGs | Managing data quality contracts and observability |
| Maintaining transformation pipelines | Building semantic layers with dbt |
| Schema migration scripts | Schema evolution policies and consumer notification |
| Capacity planning for batch windows | Cost optimization for continuous replication |
The discipline shifts from pipeline builder to data platform architect. The skills required — contract negotiation, cost modeling, semantic layer design — are different from orchestration and connector debugging, but they are no less engineering.
This shift is already visible in Databricks positioning Lakehouse Sync as an "activation" feature, not a replication tool. As the agentic data engineering model gains traction, the value proposition is clear: data engineers should own the platform's semantics, cost model, and quality guarantees — not the plumbing that moves bytes between systems.
Exceptions and Limits
Zero-ETL is not universally applicable. Teams working in regulated industries — financial services, healthcare, government — often need the transformation layer for compliance reasons. PII must be stripped before data reaches analytics. Financial figures must be validated through reconciliation. Audit trails require lineage tracking that pipeline code provides inherently.
Organizations with extensive dbt deployments face a different limit. If 80% of analytics value comes from transformed models — slowly changing dimensions, business logic aggregations, cross-source joins — then zero-ETL acceleration of the raw layer does not address the bottleneck. The constraint is transformation complexity, not ingestion speed.
Multi-cloud environments present a third limit. Native replication works within a vendor's ecosystem. Replicating from AWS DynamoDB to GCP BigQuery, or from Azure Cosmos DB to Snowflake on AWS, requires exactly the kind of cross-platform integration that zero-ETL is designed to eliminate — but the connectors are not there yet.
Honest Assessment
| Dimension | Zero-ETL | Traditional ETL |
|---|---|---|
| Time to first dashboard | Hours to days | Weeks to months |
| Operational overhead | Low (managed connectors) | High (custom pipelines) |
| Schema flexibility | High (schema-on-read) | Low (schema-on-write gate) |
| Data quality controls | Must be built separately | Built into transformation |
| Vendor portability | Low (proprietary connectors) | Higher (standards-based) |
| Cost predictability | Variable (compute + replication) | More predictable (scheduled batches) |
| Audit trail | Minimal (raw replication) | Strong (transformation logs) |
| Transformation depth | Shallow (deferred to dbt) | Deep (in-pipeline logic) |
Actionable Takeaways
- Audit your pipeline portfolio. Classify every pipeline by transformation complexity. If 60% or more are simple replication (no business logic, just type casting and deduplication), zero-ETL can replace them.
- Start with SaaS replication, not core data. CRM, marketing, and support tool connectors are low-risk entry points. Core financial and compliance data should stay on traditional ETL until governance catches up.
- Invest in data contracts before migrating. Versioned schema agreements between source and consumer teams prevent schema-drift incidents. Without them, zero-ETL's instant propagation becomes a liability.
- Budget for platform cost modeling. Connector fees, continuous replication compute, and on-demand transformation costs accumulate differently than batch infrastructure. Run a cost comparison before committing.
- Layer observability on top immediately. Data observability is not optional when pipelines disappear. Freshness checks, volume anomaly detection, and schema drift alerts must cover the integration points that zero-ETL replaces.
- Keep dbt for semantics. Zero-ETL moves data faster. dbt models still define what that data means. The winning pattern is fast ingestion plus trusted transformation — not one without the other.