dltHub

We enable data teams to move data together

Why we are building dlt and dltHub

The proliferation of Python libraries such as Pandas, Jupyter notebooks, NumPy or PyTorch revolutionised the ML/AI space by allowing millions of practitioners to actively build the ecosystem.

We aim to bring such revolution to the data space. dlt is a pip-installable, minimalistic library that anyone writing Python code can use. dlt enables those people to create new datasets and move them to tools they use - be it other Python projects or engines or tooling from the Modern Data Stack.

We believe that dlt will enable millions of people working with the data to create pipelines and datasets and share them within organizations, with their customers and between individuals.

dlt works with both the Modern Data Stack and the Composable Data Stack

dlt acts as a bridge between two worlds. It augments a company's existing tech investments into the Modern Data Stack. For example we have great support for Airflow, Dagster or dbt. At the same time it supports Pythonic projects such as Arrow, Pydantic or Streamlit.

dltHub is our user community that grows around such a core tool that bridges both worlds and where solutions, code, and advice are shared. Eventually we think of dltHub as the GitHub for data pipelines, an ecosystem of pipeline creators and maintainers as well as the other data folks. Here, eventually millions of code snippets will be created and shared. As a first step we aim to launch a dedicated dltHub - a place where 10,000s of code snippets are shared - before the end of 2024.

To enable Pythonic data pipelines in modern organizations, dlt needs to be not only loved by Python practitioners but also data engineers and companies building internal data platforms. dltHub’s data platform-in-a-box offers a set of products & services aimed at the needs of data platform teams.

Image showing data collection and dltHub

Further background on why we are building dlt and dltHub

The rise of cloud and massively parallel processing (MPP) databases

The rise of cloud and massively parallel processing (MPP) databases began with foundational storage solutions like Amazon S3 and Microsoft Azure Blob Storage, which provided scalable, durable, and cost-effective storage for large datasets.

In 2013, AWS launched Redshift, a fully managed data warehouse service that leveraged S3's storage capabilities, offering powerful analytics without the need for extensive infrastructure. Google soon followed with BigQuery and then Snowflake entered the scene next, introducing a cloud-native data platform with a unique architecture that separates compute and storage, allowing for flexible and efficient data management. Databricks, leveraging Apache Spark, brought advanced analytics and machine learning capabilities to the cloud, catering to a growing demand for data science and big data processing.

Image showing data storage and processing

Companies are investing in the Modern Data Stack

These platforms provided the necessary infrastructure to store vast amounts of data and perform complex analytics efficiently and cost-effectively. The Modern Data Stack emerged as companies could now collect data from multiple sources, transform it into usable formats, and analyze it in near real-time, all within the cloud environment. It became the standard approach for organizations seeking to harness data for insights and innovation.

Image showing data storage and processing

ML/AI brings Python developers and their tooling into the enterprise

Driven by ML/AI the number of Python developers has increased from 7 million in 2017 to 18.2M in Q1 2024. They are entering modern organizations in masse. Organizations often employ them for data-related jobs, especially in data engineering, machine learning, data science and analytics. They must work with established data sources, data stores, and data pipelines that are essential to the business of these organizations.

Python practitioners have amazing libraries for number crunching (e.g. NumPP, pandas, etc), model training (Huggingface, etc), sharing work results (Streamlit, etc). They are using Pythonic notebooks (such as Jupyter Notebooks or Google Collab) for intuitive and accessible data exploration and visualization.

Image showing analytics and visualization tools

The emergence of the Pythonic Composable Data Stack

We continue to see the same pattern play out over and over again. Lack of proper tools to do data jobs is a severe problem in modern organizations. Python practitioners who want to use business data must either wait for a data engineer to help them load it. Or, more frequently, they need to write a Python script themselves and connect it to tools they have been trained on like Jupyter Notebooks. Even more frequently, expensive employees with rare skills (e.g. data engineers, MLOps, etc) are tasked with creating internal data platforms based on custom Python scripts. These workflows are difficult to maintain and do not scale in organizations.

Increasingly, organisations are implementing the Pythonic Composable Data Stack. In the last few years, the Pythonic data lake has surged in popularity, driven by advancements in projects such as Apache Arrow and DuckDB, which facilitate efficient in-memory processing and querying of large datasets. The integration of these tools with cloud storage solutions and frameworks like PyArrow has enabled seamless, scalable data lakes accessible through Python. A new set of related tooling is emerging.

Image showing analytics and visualization tools

How you can partner with us

Become a dltHub Technology partner

We are not only solving the data ingestion for ourselves but also the wider ecosystem.

dlt is a library. When our partners add a library to their code, it belongs to them. We want to be a part of other libraries - be it open source or internal libraries or products.

Dagster added dlt to its embedded elt platform, so did Prefect. dlt is under the hood of Posthog's data warehouse product.

For our partners we offer insider resources, unique support and certifications. For example write helpers to make your companies tools and workflows more effective with our library (eg dbt, Airflow, Streamlit).

Become a dltHub Consulting partner

From Q3 2024 we are starting to partner with data consultancies to ensure our customers get the most value from dlt.

Gain insider resources and unique opportunities while helping customers get the most from dlt.

Take advantage of partner-only training and enablement resources. Then help your team stand out by becoming certified dlt developers.

untitled data company worked with us as a design partner on the REST API toolkit, so it can build data pipelines for their consulting clients in very little time and very little code.

Work closely with our team on joint marketing activities, sales opportunities, product development, and joint solutions.

Built by engineers for engineers. Open source.