Standardizing Ingestion and its metadata for compliant Data Platforms
- Adrian Brudaru,
Co-Founder & CDO
Why the rush to be compliant?
As has always been true in the tech industry, each year brings more possibilities, use cases and entropy. Recently, many regulations were put in place to protect consumer data. Many companies have not yet taken steps to be compliant, or just do the minimum they have to. This has brought us to a situation where we essentially have accumulated a “technical debt” that is preventing us from moving further.
With the advent of LLMs, the potential for data breaches grows exponentially. Digital trails of data transfer and usage become permanent records - so companies cannot sweep lack of compliance under the rug either.
So now companies are faced with a tough decision: Govern your data properly, or be prevented from using any modern tools and risk losing even the ability to leverage what you had so far.
So while we can all argue about the benefits of LLMs, one thing we all agree on: Proper data handling is vital to the business bottom line. The consequences of not doing so can be catastrophic.
Now, onward to more positive topics - how do we solve this?
The research is in - It takes proactive action to remain competitive.
Fraunhofer Institute and others have written articles on the topic. Their findings? To succeed in the post compliance world, you need to be proactive.
Here are 3 strategies they highlight for success:
Proactive compliance strategy: Modern data platforms enable companies to integrate compliance measures from the outset, designing systems with privacy and security at their core. By proactively addressing potential data protection issues, companies can ensure that their data handling practices are in line with regulatory expectations and ready to adapt to any new requirements that may arise.
Data Minimization: Collecting only the data necessary for specific, legitimate purposes — is becoming a best practice in modern data management. This approach not only aligns with GDPR's principles but also reduces the risk and complexity associated with managing larger volumes of sensitive data. Modern data platforms can help enforce these principles by providing tools that ensure PII is stripped from curated data where not needed, to enable its broad usage. This highlights a need for curating data before ingestion, which is currently not tackled in the majority of off the shelf products but also requires automation with humans in the loop, complicating things for saas vendors.
Regular training and awareness: Ensuring that all employees are regularly trained on GDPR requirements and best practices in data protection is essential. Modern data platforms can support these efforts by including user-friendly interfaces and clear guidelines on data handling processes. Regular training helps create a culture of data protection awareness throughout the organization, making compliance a shared responsibility among all employees.
Why standardise ingestion and how can that help with governance?
Off the shelf governance solutions are used to govern data at rest, such as tables or buckets. But this isn’t because this is the best way to do it - instead, it’s simply the only easy to sell solution that plugs into the only partial standard everyone has - data at rest.
But this is simply because we do not have metadata laying around disconnected from data - we don’t have good source docs that tell us what data we may find, and we don’t have data contracts where the producer tells us what’s up. Why don’t we? Because, without clear mandates and solutions to do this, the problem is very hard to solve.
But what if we had some magic documentation about our data sources, before even filling any tables? What if we could profile source data based on source info, or in flight before loading? And what if we had a single way of doing it that’s not dependent on the storage solution?
We do, but let’s agree on the problem before jumping to solutions.
What is the problem?
Why do we have this problem at the forefront now, and what exactly is it? ingestion? governance? scale?
Increasing data applications and the Compliance Burden
Data is proliferating at an unprecedented rate and so are the use cases. New sources, new capabilities, and a new wave of professionals are creating change faster than ever. Companies are forced to make at least some investments in this space if they are to remain afloat in an uncertain and competitive financial climate, or if they are going to be around to take advantage of prosperity cycles such as those created by ML, LLMs, AI.
This massive volume of data brings with it a significant compliance burden. Regulations such as GDPR in Europe, CCPA in California, and numerous other data protection laws worldwide require businesses to handle personal data responsibly. These laws mandate strict measures around data collection, processing, storage, and disposal, aiming to protect consumer privacy and prevent data breaches. However, the sheer scale of data generated makes it challenging for companies to keep track of all data points, much less manage them in compliance with all applicable laws, which often have nuanced and specific requirements.
If you think governance and metadata automation is needed to handle the new volumes effectively, you’re right.
Lack of Standardization in Data Ingestion makes entropy downstream
The problems are compounded by the lack of standardization in data ingestion processes. Currently, practices in data ingestion are highly fragmented. This fragmentation typically results in inconsistencies in how data is collected, formatted, documented, and secured.
Without standardized practices, each dataset may be ingested without a clear understanding of its contents, origin, or sensitivity. This ad-hoc approach often leads to scenarios where data is not adequately documented, making it nearly impossible to ensure compliance throughout its lifecycle. Moreover, when data ingestion does not incorporate compliance and data governance from the start, businesses may find themselves retroactively trying to implement these measures, which is less effective and more costly.
Ineffectiveness of Data Contracts
Data contracts, intended to define and regulate the details of data capture, sharing, and usage between parties, often don't work as intended. One fundamental issue is the complexity involved in drafting these contracts. They require all parties to agree on numerous specific elements, such as data formats, usage rights, compliance standards, and security measures. This agreement process can be exceedingly difficult, especially in environments with multiple stakeholders having varying objectives and priorities.
Moreover, even when data contracts are in place, enforcing them can be equally challenging. Monitoring compliance and ensuring all parties adhere to agreed standards throughout the data lifecycle demands continuous oversight and resources, which many organizations find difficult to maintain.
Standardisation is essential for enabling decentralization without losing the ability to operate cohesively
Data mesh, or in other words microservices for data, advocates for full decentralisation. This could work if each company invents their own protocols, apis and start enforcing and supporting decentralisation - but few companies are in a position to be able to pull it off.
For decentralization to allow teams the freedom to innovate and tailor their workflows, we must have shared protocols, such as through governance APIs. Standardization provides a foundational layer that supports decentralized operations, enabling different segments of an organization to work together efficiently. Adopting common standards helps in maintaining system integrity and operational effectiveness across the board, balancing flexibility with the necessary control to achieve broad organizational goals.
Lack of community standardization and Interoperability
And yes we need standards but it looks like they need to be invented every time.
Without universally accepted standards, each organization may implement its own unique set of rules and procedures. This lack of consistency leads to problems when integrating data from different sources, as each dataset may come with its own set of governance rules and formats. The resulting interoperability issues not only complicate data management but also increase the risk of non-compliance with broader regulatory requirements.
There’s difficulty in achieving consensus
Achieving consensus on data governance practices is inherently challenging, especially in sectors with numerous stakeholders or in cross-border operations where different legal frameworks and cultural norms about data privacy can clash. This difficulty is compounded in dynamic environments where data types and usage scenarios evolve rapidly, requiring frequent updates to governance agreements that stakeholders may be slow or reluctant to adopt.
The need for proactive solutions
Given these challenges, it's clear that if we want to solve the problem of data governance effectively, we can't rely solely on external solutions or hope that standard practices will emerge on their own. Instead, businesses need to take a proactive role in developing and implementing governance frameworks that work for their specific data ecosystems.
By actively engaging in the development of governance solutions and encouraging a culture of compliance and cooperation within the industry, businesses can significantly mitigate the risks associated with data management and unlock the full potential of their data assets.
Tool entropy at ingestion creates downstream problems
When companies use a mix of tools from SaaS vendors' connector catalogs and event ingestion solutions, it can lead to issues with data quality, lineage, and governance further down the line. Each tool might handle data differently, which can make it difficult to manage, track, and control the data effectively.
World, the time has come to… Standardize
Largely the reason why the metadata problem is very hard to manage is the high entropy of data. Many data sets, many levels of nesting, many columns. A stream of jsons might easily have 3.000 nested columns, who’s gonna clean that up?
Worse, if we ingest data with a diversity of tools, they introduce all kinds of other problems, such as breaking the lineage from source to destination (proprietary ETLs), or diverse and shallow levels of metadata available. And when it comes to actually managing these, there’s no central view or central control plane to be able to set policies or turn the data flows off.
So what we need is a good ingestion standard that can manage metadata well. Why ingestion?
Because moving left of that, you fall into data contracts, which are high maintenance human contracts. Moving right of that, and you’re exactly where we are now, ingesting anything and curating data at rest. So doing it at ingestion allows us to reduce one level of tool and technical entropy without falling into the trap of human relationship entropy.
What could a perfect standard ingestion library look like?
Here are the key characteristics of a standard
- Developer-focused tooling: Developer focused tooling addresses the creator, not the consumer. This stimulates a healthy open source ecosystem where the user is able to contribute.
- Simplicity and ease of use: Developer first design, shallow learning curve and simple API are key to successful onboarding and usage of a standard.
- Flexibility and customisability: A standard can’t be a one trick pony, it needs to solve a sizable chunk of the problem space.
- Scalability. Enough said.
- Robustness and Reliability: Ensures reliable, fault-tolerant data ingestion with features like error handling, retries, atomic transaction support and things like idempotence.
- Security and Compliance: Includes pseudonymisation, access controls, and auditing capabilities, complying with data protection regulations like GDPR and HIPAA.
- Extensibility: A standard must plug and play into its ecosystem, so the ecosystem must be enabled to run it and extend it.
- Community and Support: Strong community support and active development, providing a rich ecosystem of tutorials, documentation, and forums.
- Cost-Effectiveness: Offers a cost-effective solution, especially if open-source, reducing overall costs compared to proprietary systems.
- Documentation and Resources: Comprehensive, clear documentation with examples and guides to facilitate adoption and reduce developer effort.
Introducing the pythonic ingestion standard - dlthub’s dlt (data load tool)
dlt was built out of simple, yet fundamental need for an accessible, universal data loading tool.
One that any team member can utilize effortlessly.
This need typically prompts mature companies to develop their own in-house solutions. However, dlt is an improvement over these home-brewed frameworks, because it’s crafted by a dedicated team aiming for production-grade robustness from the onset and easy accessibility and onboarding to all.
Similar to how Confluent stewards Kafka, a standard in open-source stream ingestion, dlthub embraces the same open-core model. This model means dlt needs to become a standard too in order for dlthub to succeed, aligning company motivations with open source user needs.
Key characteristics of a standard dlt library
Developer-focused
dlt prioritizes the data engineer's needs, focusing on easy development along with control and capability rather than merely serving as a connector catalog for analysts. It's engineered to enhance functionality without replacing existing components like orchestrators or transformation layers. It encourages synergy with the best tools in each category, and native interoperability without the need for complexities like Docker containers.
Simplicity and Ease of Use
Shallow learning curve, you can just use dlt. In its simplest form you can load python data (df, json, lists, dicts) magically to destination with a single function call. For the data engineer, topics like incremental loading, memory management and parallelism are just a matter of configuration. dlt makes easy things easy and hard things possible.
Flexibility, Extensibility, Customisability
It’s very easy to build your own source, destination or customise any of the existing behaviours of dlt. All the code is open, in python, and can be modified to your needs. Besides sources and destinations, there’s already an ecosystem of integrations like dbt runner, streamlit client, dagster and airflow integrations, and other tools use dlt under the hood such as posthog
Scalability
dlt is scalable, in many ways, from leveraging async and parallelism, to using arrow where possible. And of course since it’s just a lightweight library you can massively parallelise any work done by it externally by running it on thousands of parallel serverless functions or hundreds of airflow workers.
Resilient and Reliabilie
dlt ensures reliable, fault-tolerant data ingestion with features like error handling, retries, atomic and idempotent loads.
Secure and compliant
Besides the security native to open source, dlt data schemas can be used for compliance tags.
Community and Support
Strong community and slack support and active development, providing a rich ecosystem of tutorials, documentation, and blog posts.
Cost-Effective
Running dlt yourself is often 50-200x cheaper than commercial Saas. In some cases you can also configure dlt to choose the loading method that causes the least cost to your destination when loading, where by default dlt would favor atomicity first.
Good documentation:
Features comprehensive, clear documentation with examples and guides to facilitate adoption and reduce developer effort.
In conclusion, dlt is already suitable as a standard, and we are moving towards increasing the ecosystem of users and integrations.
Beyond an open source standard, how can dlthub help?
As we mentioned, we are an open core company and are working towards creating a commercial offering to acompany dlt. We are still in the early stages, so we are offering a mix of hands on help and tooling to facilitate our learning.
Shift your team from data engineering to data platform engineering
dlt automates a lot of the work a data engineer would do while lowering the entry level in building and maintaining pipelines. With features like the declarative REST API connector, schema evolution and simple declarative incremental loading, data engineering teams can focus on enablement instead of fixing things. This brings them in alignment with concepts like data platforms and data mesh, focused on enabling domain experts to handle domain related work.
dlt’s OSS offering:
The open source library contains everything you need to build standardized ingestion for data platforms complete with scaling and metadata. Read more about it in these case articles:
Data platform setups:
- Harness https://dlthub.com/case-studies/harness
- Taktile: Case study
- Department of Education, New South Wales: Medium blog post
Replacing expensive ingestion:
- Yummy.eu: Replace 5tran for 182x cost savings and 10x speed gain
- Taktile: dlt on AWS Lambda to process millions of daily tracking events
- dlthub: Replacing Segment with a 20-100x cheaper solution.
- William Laroche (freelancer): Streaming Pub/Sub JSON to Cloud SQL PostgreSQL on GCP
dlt integrations
- dbt runners: dlt offers core and cloud runner.
- Deployment helpers for orchestration: dlt offers runners for airflow, github actions. Dagster offers a dlt runner too.
- Openapi/swagger/Fastapi: dlt features a spec to pipeline generator that can create a pipeline from an OpenAPI specification.
dlthub’s commercial offering:
- Building a data platform? DltHub’s solutions engineering team can lend a helping hand with things like specialised runners or point to point integrations.
- Replacing expensive or old components? dlthub’s solutions team has extensive experience and can lend a helping hand.
- dbt package generator: With dlt’s dbt package generation utility, you can generate a dbt model for staging layer or star schemas on top of dlt pipelines, complete with tests and scaffolding, ready for customisation to your business metric needs.
- Custom integrations with other systems such as specialized runners, tighter integrations with orchestrators, custom enterprise destinations and credential vaults.
Want to learn more? Contact our solutions engineering team through this form.
Emit Governance related metadata from anywhere to anywhere
A single pane of glass to rule all the pipelines is something many of us want. dlt runs anywhere, which may often include various locations outside of a typical orchestrator. To keep the single plane of control, metatada emitted by DLT can be centralised just like an orchestrator centralises the data emitted by its workers.
Enable data mesh or other decentralised architectures by having a standard for ingestion and metadata of data sources.
dlt’s OSS offering
- capture all kinds of metadata and send it somewhere, from schemas, to migrations, row counts, runs and their timing, information for lineage, etc.
- send this data to sentry, consume it in python or load it back to db.
- annotate your schemas with your own taxonomy and use it downstream in your applications
dlthub’s commercial offering:
- Assistance with customising the type of metadata being sent and to where you desire, be that a custom solution, an orchestrator, or somewhere else.
- PII tagging, lineage and documentation
Want to learn more? Contact our solutions engineering team through this form.
In conclusion, Standardize your ingestion!
- Get involved: Join our open source community to shape the future of standardized data ingestion.
- Begin with expertise: Get a helping hand from our solutions team!