dltHub
Blog /

Integrating CI/CD Practices into Data Engineering

  • Adrian Brudaru,
    Co-Founder & CDO

Josh Wills, ex-Google, Cloudera, and Slack, has spent years fixing broken data pipelines. Known for his no-BS takes on data reliability, he cuts straight to the point: data engineering is still YOLO, and it's costing us big.

This article distills key takeaways from his talk at a dltHub event in 2024 and underscores why we built dlt+ Cache, to bring structured, repeatable testing to data engineering.

The Problem: Data Engineers Are Stuck in a YOLO Workflow

Software engineering wasn’t always this structured. Decades ago, developers deployed code straight to production with little testing, just like many data engineers do today. Over time, software teams built CI/CD pipelines to automate testing, catch errors early, and iterate quickly. But data engineers? Many still push transformations straight to prod and hope nothing breaks, because automated testing infrastructure barely exists.

“Data quality problems are things like the schema was changed and no one told us, so everything broke.” - Josh

Some argue that data engineering is fundamentally different from software engineering and doesn’t require the same CI/CD rigor. Wrong. Data pipelines operate in dynamic, ever-changing environments where upstream dependencies (APIs, databases, business logic) can shift without warning. If anything, data engineers face more unpredictability than software developers, yet they lack the equivalent safeguards.

The place of throw-away code

Not all data code needs to be perfect. In fact, iteration speed is often more important than rigor in data science and analytics. But when those throwaway experiments become data engineering production workloads, things break. That’s when YOLO needs to end.

The Real Cost of Data Engineering YOLO

Running transformations directly in production has serious downsides:

  • Wasted compute: Debugging inside a data warehouse is an expensive luxury. Companies like Airbnb and Netflix have documented their struggles with skyrocketing warehouse costs due to inefficient testing workflows.
  • Slow iteration: Even minor changes require full-scale runs, leading to long feedback loops. Engineers end up hesitant to experiment, stalling innovation.
  • Pipeline lockups: A single broken transformation can cause cascading failures. This has real consequences: financial reports, AI models, and customer dashboards depend on reliable data.
“The cost of fixing defects goes up by several orders of magnitude as we move from the coding phase to unit testing, to functional testing, to system testing, to release.” - Josh

What If Data Engineering Had CI/CD?

CI/CD (Continuous Integration and Continuous Deployment) is what keeps modern software engineering moving fast without breaking things. It ensures that every change is tested, integrated, and deployed in a structured, automated way. But what does that actually mean?

At its core, CI/CD is about eliminating manual, error-prone steps from the development cycle. Here’s how it works:

  1. Developers commit code to a shared repository: no "solo YOLO" changes.
  2. Automated tests run to catch issues before they reach production.
  3. Code is merged into the main branch only if it passes all tests.
  4. Deployments happen automatically, ensuring stable and continuous delivery.

But why does this matter? Without CI/CD, every code deployment is a potential disaster waiting to happen. It’s slow, painful, and packed with uncertainty. Yet, in data engineering, most teams still live in this world.

"Imagine not testing in prod?????" – Pedram Navid, Chief Dashboard Officer at Dagster

Now, think about what would change if data engineers had CI/CD:

✅ Transformations tested locally before deployment.
✅ Schema changes validated instantly, preventing breaking changes.
✅ Integration tests run on every PR, ensuring smooth operation before anything hits production.

“What if I could run my entire data pipeline on every PR? Every single upstream system, every single change.” - Josh

This isn’t just a nice idea, it’s exactly what dlt+ Cache enables. We finally have the tech and talent to bring real CI/CD to data engineering, and it’s about time.

How Data Contracts and CI/CD Work Together

You have probably heard of Write-Audit-Publish and Data Contracts, and wonder how this relates to CI/CD and testing.

CI-CD likely runs on a WAP pattern, two types of tests: Code, and Data.

Execution patterns:

- CI/CI triggers code tests when new code is commited.
- WAP is an execution pattern of writing data to a buffer before running tests on it.

Checks:

- Data contracts check data schema and quality, and on failure alert the data owner and users

- Code tests check if the changed code is valid (unit tests), and if it is compatible with upstream and downstream dependencies (integration tests).

Testing code and data: More in common than different


To run code tests, we need data (and runtime) to run the code against, so this means we will generally use something like a WAP pattern for testing code too.

A New Standard for Data Engineering

Data engineering doesn’t have to be YOLO anymore. We finally have the tools to shift left, moving testing earlier in the development cycle, just like software engineers do.

“Data engineers are in thrall to the monolithic data warehouse. We need to break free and solve problems wherever they occur.” - Josh

Try dlt+ Cache Today

It’s time to stop debugging in production. Test transformations like a software engineer. Start using dlt+ Cache.

dlt+ Cache is in early access,