Semantic data contracts
- Adrian Brudaru,
Co-Founder & CDO
Introduction
If you're working in data engineering, you've probably felt the pain of maintaining data quality and governance, especially as organizations grow and data becomes more distributed. Traditional centralized data teams often can't keep up with the speed and complexity of modern data needs.
That's where data mesh comes in. Essentially microservices applied to data, this new architectural approach decentralizes data ownership to domain-specific teams. This allows for greater scalability and agility. But with this decentralization comes new challenges, particularly around data governance. And governance is where agility is usually once again lost.
Enter semantic data contracts. They embed governance directly into your data pipelines, defining both the structure and the meaning of your data. You can manage and version them using tools like GitHub, fitting right into your existing workflows—whether centralized or decentralized.
In this article, I'll walk you through what data mesh is, the shortcomings of its traditional governance methods, and how semantic data contracts offer a better solution for modern data engineering.
The problem with traditional data governance
So, what's data mesh all about? It's an architectural approach that treats data as a product and assigns ownership of data domains to specific teams. Instead of having a central data team handle everything, each domain team (like marketing, sales, or operations) manages its own data pipelines, storage, and processing.
To ensure that the entire organization can still function cohesively, data mesh relies on data governance APIs. These are interfaces designed to enforce policies, standards, and best practices across all the different domains. The idea is to provide a way for various teams to communicate and adhere to company-wide data governance policies, even while working independently.
Shortcomings of using APIs
While data mesh and its governance APIs aim to solve the challenges of decentralized data management, they come with their own set of problems.
Wrong persona when translating from microservices to data
The issue stems from the fact that microservices is for devs that usually create APIs for their work. Data and business people on the other hand do not create APIs on the daily, so this becomes a complexity that can only be serviced by other roles. This introduces new bottlenecks. For technical governance to be manageable by data people, we need more approachable solutions.
Bureaucratic process for updates is not agile
These governance APIs are often rigid. Versioning them becomes a social problem instead of a collaborative workflow. Requirements have to be defined upfront, agreed upon, approved, and sent to another team before a solution can be integrated.
This is in stark contrast to classic developer workflows, where developers can work collaboratively on a pull request and iterate before settling on a solution.
What are semantic data contracts?
So, what's the alternative? Semantic data contracts! Think of them as agreements that capture both the structure (syntax) and the meaning (semantics) of your data, embedded directly into your data pipelines.
They allow you to define the specific rules and requirements unique to your business without the overhead and rigidity of governance APIs. Because they're managed as code and versioned in GitHub, they offer a flexible and adaptable approach.
Usual components
- Schema definition. Structures, types, relationships.
- Semantic definition. PII, Industry specific taxonomy.
- Data quality rules, like value ranges, formats, and required fields.
- Access policies, such as which access groups may use this data, or how, such as perhaps devs cannot see PII.
- Business rules: Custom rules specific to a company's processes, such as compliance requirements, data retention policies, or domain-specific validations.
- Usage docs: How data should be processed or transformed.
Why semantic data contracts are better than apis
Semantic data contracts vs apis is actually a discussion of the right abstraction for business or data people. Do we expect these professionals to build APIs like developers who do microservices, or rather work with YAML files in Github? The latter is much more fitting, when we consider than both business and technical people can collaborate efficiently on a GitHub thread.
Additionally, by using something like github, versioning becomes a native process and enables safe migrations without introducing obstacles.
dltHub’s role
We're working on developing and refining these concepts to make data governance a more efficient, automated process. The goal is to empower organizations to maintain high data quality and compliance without sacrificing agility or innovation. This proactive, simplified approach increases an organisation’s ability to innovate.
Enabling portability of data catalogs
A sometimes overlooked part of governance is metadata needed for access control. This data is often managed in the access layer, which locks the data access to a specific catalog or data access technology. If we wish to avoid vendor locking, this metadata is best maintained outside of catalogs. By managing it at ingestion through a semantic contract, we enable deploying the same access governance across multiple access points.
This enables the portability required in building a portable data lake.