dltHub
Blog /

Data contract agreement vs enforcement

  • Adrian Brudaru,
    Co-Founder & CDO

In this article, I briefly want to touch on data contracts, to clarify the distinction between enforcement and agreement.

What is a data contract?

A data contract is an architectural mechanism designed to ensure reliable interfaces between decoupled systems. It’s made of two parts:

  • Agreement: between a data producer and the data consumer team.
    • What rule must the data follow?
    • What should happen when the rule is broken?
  • Enforcement: A technical mechanism
    • tests the data in accordance to the rules
    • rejects non-conforming data.

A simple example of a contract

In 2018 I was working at a gym aggregator company which used a production system for managing core operations (subscription types, gym check-ins etc.), and a CRM for handling sales operations. A typical setup for many companies.

The sales agents would create a CRM company for each new signup on the production system, or assign the signup to an existing CRM company. If not, those sign-ups’ purchases were no longer attributed to the sales agents. Because a “company” is not a “company” but often has multiple fiscal entities or contacts, and needs human identity resolution, they were doing this process manually.

So the contract was as follows: Every day in the morning, a SQL test would check if a production company did not have a CRM company assigned. If an offending record was found, it was alerted to the sales team. The sales team could then fix the data and re-trigger the calculations, which would then attribute the purchases to them in the reporting layer.

Do this with dlt: Dataset access for sql/python, alert to slack.

A complex example of contract and enforcement

Say you are collecting events from a user’s client, such as from their browser when visiting your website. Client-side event tracking means that the user’s client is sending you the data. Since this client is a browser, the user could in theory inspect the website code and decide to execute some similar code and send your custom trash instead of real usage events.

So who’s the producer here, and what’s the actionable?

You can ask the website development team to do better and check events before sending them from the browser, but if a user decides to circumvent the website code, they can.

So, the real data contract is with the App team - are they sending you buggy data, or perhaps from their dev branches? This can be solved with an agreement as a test, that the team can use to ensure they send clean data.

But for the malicious users? There’s no contract with them, just enforcement. While the dev team can validate things on their side to ensure they don’t send us trash, we should also validate the same schema contract on our side to ensure no malicious users sent trash either. Together with staging the data and doing DQ checks, we can add more safety to isolate any bad data.

Do this with dlt: Schema contracts, data quality checks (dltHub early access), stage data to local and test before loading.

LLM validation - pure enforcement

Let’s say you’re using an LLM to get semi-structured (JSON) data from unstructured (text, video). Your LLM might decide to do something random. For such cases, you’re not doing a data contract - there’s nobody to agree with - you are just creating a technical enforcement between your two components.

Boundaries, but softer

Sometimes, we want to enable the source to evolve within some constraints. For those cases, dlt’s schema contracts let you configure your enforcement to be somewhere between strict and schema evolution.

You can consider Tables, columns, types, and things you can do with them like

  • evolve: No constraints, add everything new.
  • freeze: This will raise an exception if data is encountered that does not fit the existing schema, so no data will be loaded to the destination.
  • discard_row: This will discard any extracted row if it does not adhere to the existing schema, and this row will not be loaded to the destination.
  • discard_value: This will discard data in an extracted row that does not adhere to the existing schema, and the row will be loaded without this data.

Do this with dlt: Schema contracts

Stability is a Feature

We talk about data contracts as if they’re about building trust, but they’re really about surviving the lack of it. By enforcing strict schemas or tests at the door, you stop treating quality as a variable outcome of human behavior and start treating it as a fixed constraint of your code.

The most valuable contract isn't the one you negotiate in a meeting; it's the one you can enforce in your pipeline. Because eventually, entropy comes for every schema, and the only thing that stops it isn't a promise, it's a constraint.