Hieu's 🇸🇪 Tech Gossip

In the world of data engineering, the "pipeline" is the unsung hero. It is the plumbing that ensures raw, messy data from a dozen different sources transforms into the polished, actionable insights that drive business decisions.

Historically, we’ve relied on a three step dance: Extract, Transform, and Load (ETL). But as cloud computing evolved and data volumes exploded, new architectures like ELT and Reverse ETL emerged. If you’re building a modern data stack, understanding these nuances is critical. In this post, we will look at the popular data architectures and how a modern solution like data lakehouse can help solve many of the problems.

The architectures

The classic: ETL (Extract, Transform, Load)

ETL is the "traditional" approach. In this workflow, data is extracted from source systems, moved to a temporary staging area where it is cleaned and formatted, and then loaded into a target data warehouse.

When to use it: ETL is ideal when you have sensitive data that needs masking before it hits the warehouse, or when you are working with legacy on-premise systems with limited processing power.
The Pro: The data in your warehouse is "clean" and ready for analysis immediately upon arrival.
The Con: It’s often slower. If you realize you needed a piece of data you filtered out during the "Transform" stage, you have to rerun the entire pipeline.

The cloud favorite: ELT (Extract, Load, Transform)

With the rise of powerful cloud data warehouses, the paradigm shifted. Why transform data on a separate mid-stream server when the warehouse itself has massive, scalable compute power?

In ELT, you extract the raw data and load it directly into the warehouse. Only after it’s safely stored do you use SQL based tools (like dbt) to transform it into various models.

The Pro: Speed and flexibility. You can store "raw" data forever and decide how to transform it later. It’s the backbone of the modern data stack.
The Con: It can lead to "data swamps" if you aren’t disciplined about documenting and managing the raw files you’ve dumped into the warehouse.

The new contender: EtLT (Extract, transform, Load, Transform)

Wait, another one? EtLT (lowercase 't') is a hybrid approach often used for compliance. You perform a "light" transformation — like hashing PII (Personally Identifiable Information) or filtering to ensure privacy, load it, and then perform the heavy business logic transformations within the warehouse.

The unified frontier: the lakehouse (Databricks)

Perhaps the most significant shift in the last few years is the rise of the data lakehouse, pioneered by Databricks.

Historically, companies had to maintain two separate systems: a data lake (cheap storage for raw/unstructured data) and a data warehouse (structured data for BI). Databricks bridged this gap with Delta Lake, an open-source storage layer that brings ACID transactions and reliability to data lakes.

The medallion architecture: Databricks popularized a specific flow for ELT/ETL called the medallion architecture. In this architecture the data is divided into 3 layers:

Bronze (raw): The landing zone for raw data (the "Load" in ELT).
Silver (filtered/cleansed): Data is joined and cleaned.
Gold (business ready): Highly aggregated data ready for ML models or executive dashboards.

Why it's different: unlike traditional warehouses, Databricks excels at data science and AI. Because it's built on Apache Spark, you can run complex Python based machine learning models on the same data that powers your SQL dashboards.

More on data lakehouse

The Data Lakehouse — a term largely coined by Databricks — seeks to end the "two-tier" era where companies had to maintain a messy data lake for data science and a rigid data warehouse for business intelligence.

By merging the cheap storage of a lake with the performance and structure of a warehouse, it promises the best of both worlds.

To understand how the data lakehouse solves the specific failures of traditional architectures, we have to look at the "pain points" of the old two-tier system (lake + warehouse) and how Databricks specific features act as the cure.

The reliability problem: "the data swamp"

The problem

Traditional Data Lakes are just folders of files (Parquet, CSV, JSON). If a write operation crashes halfway, you end up with "garbage" data. There is no way to "undo" a mistake, and multiple people can't write to the same file safely.

The Databricks solution: Delta Lake

Delta Lake adds a transaction log (the _delta_log folder) to your cloud storage. This brings ACID transactions to the lake.

Imagine a streaming pipeline adding sales data every minute. In a traditional data lake system, if the stream flickers, you might get a partial file that breaks every downstream report. With Databricks, Delta Lake ensures the write is "all or nothing." If it fails, the transaction log simply doesn't record the new version, and readers continue to see the last "good" state. Plus, you can use time travel (e.g., VERSION AS OF 10) to roll back the entire table to yesterday if a bug introduced bad data.

The performance problem

The problem

Data warehouses are fast because they use proprietary, highly optimized storage. Data lakes are slow because they have to read "dumb" files from cloud storage, which involves high latency and "small files" overhead.

The Databricks solution: Photon engine & predictive I/O

Databricks developed Photon, a vectorized query engine written in C++ (not Java/Spark). It sits on top of the lake and executes SQL at speeds that rival or beat traditional warehouses.

Take for an example a marketing team wants to query 5 years of customer behavior data (billions of rows) to find high-value users. In a traditional data lake, a standard Spark query might take 20 minutes because it has to scan every single file in the directory. In Databricks, Photon uses data skipping and Z ordering (clustering related data together). It knows exactly which files don't contain the data and skips them entirely, returning the result in seconds.

The governance problem: "two security models"

The problem

In the old way, you had to manage security in two places: IAM roles for the data lake (files) and SQL permissions for the warehouse (tables). This leads to "security drift" where someone has access to the "gold" table but can also accidentally see the raw "bronze" PII data in the lake.

The Databricks solution: Unity Catalog

Unity Catalog is a unified governance layer. It treats everything - files, tables, ML models, and even dashboards — as a single searchable catalog.

Let's say we have a big dataset which contains VINs and GPS coordinates. Our data scientists need the data to improve fuel efficiency algorithms, but due to GPDR, they should not be able to see the exact location of individual drivers. In a traditional data warehouse, we'd normally have to create a "masked" copy of the dataset, doubling storage costs and creating a maintenance nightmare.

With Unity Catalog, we can use the same "gold" table with dynamic data masking so that when engineers query it, they see the full GPS coordinates while a data scientist will only see the GPS coordinates rounded to the nearest city or country - all in real time.

The complexity problem: "brittle ETL"

The problem

Building pipelines (ETL) is usually a mess of custom Python scripts, Airflow DAGs, and manual "glue" code. If a source schema changes (e.g., a new column is added), the whole pipeline breaks.

The Databricks solution: Delta Live Tables (DLT)

DLT is a "declarative" framework. You don't tell the system how to move the data; you define what the result should look like, and Databricks manages the rest.

An example is if we receives sensor data with occasional null values that crash our dashboards. In a traditional data lake system, we’d write complex "if-then" logic to catch errors and manually restart the pipeline when it fails. In DLT, we define an expectation (e.g., CONSTRAINT valid_temp EXPECT (temperature > -50)). If a record violates this, DLT can automatically drop it, quarantine it, or alert us — all while the pipeline keeps running smoothly.