Blog /November 19, 2024

Shift YOURSELF Left

Adrian Brudaru,
Co-Founder & CDO

We recently organised a community event in San Francisco. In this article, I will discuss a talk by Josh Wills, author, senior data engineer, consultant, influencer and investor. I will summarize the talk and give it my pinch of salt and pepper.

Shift Yourself Left

In his talk, Josh discusses the answer to the question "who should care about data quality?". For the past few years, the industry is telling us we need data contracts and shift left. But is that what we really need to solve data quality? let's define the 2 for the context of the talk:

A data contract is a static API-spec-like agreement between data producers and consumers but not a complete data test. True testing must verify business logic and schema integrity in usage.

Shift left involves detecting and fixing problems earlier in the lifecycle (e.g., during coding rather than production). In theory it sounds good but "left" is an actual team, not a concept, and do you think they have time for your extra requirements?

When are data contracts a good answer?

Data often comes from external sources. For example, we might have a public webhook for capturing web events. For such open sources, usually the data is produced by the application - but it could also be that a bot or a malicious user sends you some random data with totally different payloads to what you are used to.

In these cases, the right thing to do is park any suspicious data, and investigate further if the volume of such events is high, so a simple data contract on the data engineer side is the best we can do, given that we are not in control of the source.

When are data contracts a mere bandage for a wooden leg?

What data contracts usually try to solve is "data quality" coming from internal sources. Here, they are insufficient. What we really want is end to end testing, and data contracts validate almost nothing about the data other than its shape.

Why shift left doesn't work

Different engineering paradigm, lack of resources, lack of skills or tech

"Shift left" doesn’t work in data engineering for a simple reason: it’s not built for the realities we deal with. Unlike software, where you can write fast unit tests or mock systems locally, data pipelines depend on massive systems, APIs, and constantly evolving business logic.

Running full tests early in development, at production scale, is impractical and expensive. The tools and infrastructure most teams use simply aren’t designed for it, and trying to make it work often creates more complexity than it solves.

Org challenges

The bigger issue is organizational. Upstream engineering teams often don’t prioritize data quality. They ship changes without worrying about downstream impact, leaving data engineers to clean up the mess. Silos like this make collaboration and early testing nearly impossible.

Red pill: Solve your own problem

The cavalry isn’t coming, and the budget for more heads is a pipe dream. If you start banging on about shifting responsibilities to other teams, don’t be surprised when they dig in their heels or conveniently “run out of bandwidth.”

No one’s going to swoop in to save your pipelines or fix your data mess. It’s on you. So, tighten that toolbelt, grab your metaphorical wrench, and get to work, own the problem, solve it.

Collaboration instead of pushing

Instead of forcing "shift left" as an unrealistic mandate, we focus on practical solutions. Collaborate directly with upstream teams. Use lightweight tools like DuckDB to run real integration tests locally without depending on expensive warehouses. Containerize your pipelines so you can test them consistently across environments. The goal isn’t to push everything earlier, it’s to make testing faster, simpler, and fit for purpose, without burning out your team in the process.

The proposed solution

Watch Josh's video for details about his proposed solution:

Proposed Solution

Detailed solution summary by LLM:

Josh's approach centers on shifting testing responsibilities earlier in the lifecycle but in a practical, engineer-friendly way. Instead of adding more work to already burdened teams, he proposes:

1. Containerize Your Data Pipelines

What it means:
Encapsulate your entire data pipeline—including its tools, dependencies, and configurations—into a Docker container. This makes the pipeline portable, consistent, and easily integrated with other systems.

How to do it:

Include all your tools (e.g., dbt, SQLMesh, dlt) and their dependencies in a single container.
Use Docker Compose if your pipeline relies on multiple services like databases, orchestrators, or storage layers.
Ensure the container can run your pipeline in a development or testing environment, mirroring production.

Why it’s important:
This decouples your pipeline from specific environments, ensuring consistency across development, testing, and production. It also allows the pipeline to run alongside upstream services in integration tests.

2. Integrate Data Pipelines into CI/CD

What it means:
Tie your containerized pipeline into your team's CI/CD system to run tests automatically whenever changes are made. This includes integration tests between upstream systems and downstream data pipelines.

How to do it:

Identify your team’s CI/CD tool (e.g., GitHub Actions, Jenkins, GitLab CI).
Configure the CI/CD pipeline to:
- Spin up your data pipeline container.
- Fetch test data from staging environments or mock upstream systems.
- Run integration tests, verifying schema consistency, business logic, and data flow.
Automate these tests for every pull request to catch issues early.

Why it’s important:
This brings integration testing into the development process, reducing the cost and impact of bugs by catching them before deployment.

3. Use Lightweight Tools Like DuckDB

What it means:
Replace the dependency on heavy, monolithic systems like data warehouses with lightweight, embeddable databases like DuckDB for local testing.

How to do it:

Set up DuckDB as a stand-in for your data warehouse during testing.
Load sample or staging data into DuckDB to simulate the pipeline’s behavior in production.
Run your entire pipeline (extract, transform, and load) locally and verify results.

Why it’s important:
DuckDB enables fast, resource-efficient testing without incurring the high costs of running queries on production-scale warehouses.

4. Simplify Data Ingestion with Tools Like dlt

What it means:
Handle data ingestion using flexible, programmable tools like dlt, which integrates seamlessly with your Python-based pipeline and CI/CD workflows.

How to do it:

Use dlt to define ingestion pipelines that pull data from upstream systems into your testing environment.
Configure dlt to handle both small-scale and large-scale ingestion, depending on your testing needs.
Embed dlt directly in your CI/CD pipeline for automated data loading during integration tests.

Why it’s important:
dlt simplifies the ingestion process, making it easier to test how upstream data changes affect your downstream pipelines.

5. Break Free from Monolithic Thinking

What it means:
Stop defining your workflows solely around the constraints of your monolithic data warehouse. Instead, adopt a system-agnostic mindset where problems are solved at the source.

How to do it:

Collaborate with upstream engineering teams to align on data schema and business logic.
Use data contracts (API-like agreements) to formalize expectations between data producers and consumers.
Build lightweight integration tests that verify schema changes and logic across all systems.

Why it’s important:
By addressing issues where they originate—whether in the frontend, backend, or data pipeline—you reduce the downstream impact of errors.

6. Learn and Collaborate

What it means:
Adopt a growth mindset by learning about adjacent disciplines (e.g., backend engineering) and collaborating with your peers to improve cross-team workflows.

How to do it:

Participate in communities like dbt Slack, DuckDB Discord, and dlt forums.
Use AI tools like OpenAI or Claude to quickly learn new concepts or troubleshoot issues.
Share your knowledge and solutions with your team to create a culture of continuous improvement.

Why it’s important:
Cross-discipline understanding and collaboration foster stronger workflows, better tooling, and higher data quality across the organization.

Final Integration: End-to-End Testing with GitHub Repo Example

Josh provides a GitHub repo as a practical example. It includes:

A sample application to demonstrate pipeline integration.
A dbt + DuckDB + dlt pipeline for data processing.
Docker Compose configurations for integration testing.
Hooks to run integration tests automatically from upstream to downstream systems.

The setup is designed to be simple, avoiding unnecessary complexity, and serves as a starting point for teams to build their own solutions.

Outcome:

By following this approach, data engineers can:

Catch issues earlier in the development process.
Reduce dependence on monolithic systems for testing.
Build more reliable, scalable, and testable pipelines.
Collaborate effectively with upstream teams to improve overall system quality.

dltHub's call to action

Josh's Github repo
Data contract docs
Example of unit testing with dagster and dlt (it helps to an extent but you really want integration testing)

Or accelerate your dlt work with the help of the dlt solutions team.

cognee - Scalable Data Layer for AI Apps - ECL Pipelines

Data mesh as a requirement in decentralised energy

Shift Yourself Left

When are data contracts a good answer?Link icon

When are data contracts a mere bandage for a wooden leg?Link icon

Why shift left doesn't workLink icon

Different engineering paradigm, lack of resources, lack of skills or techLink icon

Org challengesLink icon

Red pill: Solve your own problemLink icon

Collaboration instead of pushingLink icon

The proposed solution

Proposed SolutionLink icon

1. Containerize Your Data PipelinesLink icon

2. Integrate Data Pipelines into CI/CDLink icon

3. Use Lightweight Tools Like DuckDBLink icon

4. Simplify Data Ingestion with Tools Like dltLink icon

5. Break Free from Monolithic ThinkingLink icon

6. Learn and CollaborateLink icon

Final Integration: End-to-End Testing with GitHub Repo ExampleLink icon

Outcome:Link icon

dltHub's call to actionLink icon

When are data contracts a good answer?

When are data contracts a mere bandage for a wooden leg?

Why shift left doesn't work

Different engineering paradigm, lack of resources, lack of skills or tech

Org challenges

Red pill: Solve your own problem

Collaboration instead of pushing

Proposed Solution

1. Containerize Your Data Pipelines

2. Integrate Data Pipelines into CI/CD

3. Use Lightweight Tools Like DuckDB

4. Simplify Data Ingestion with Tools Like dlt

5. Break Free from Monolithic Thinking

6. Learn and Collaborate

Final Integration: End-to-End Testing with GitHub Repo Example

Outcome:

dltHub's call to action