dltHub

We tackle the unsolved problem of Pythonic data ingestion in modern organizations

The rise of cloud and massively parallel processing (MPP) databases

The rise of cloud and massively parallel processing (MPP) databases began with foundational storage solutions like Amazon S3 and Microsoft Azure Blob Storage, which provided scalable, durable, and cost-effective storage for large datasets.

In 2013, AWS launched Redshift, a fully managed data warehouse service that leveraged S3's storage capabilities, offering powerful analytics without the need for extensive infrastructure. Google soon followed with BigQuery and then Snowflake entered the scene next, introducing a cloud-native data platform with a unique architecture that separates compute and storage, allowing for flexible and efficient data management. Databricks, leveraging Apache Spark, brought advanced analytics and machine learning capabilities to the cloud, catering to a growing demand for data science and big data processing.


Companies are investing into the Modern Data Stack

These platforms provided the necessary infrastructure to store vast amounts of data and perform complex analytics efficiently and cost-effectively. The Modern Data Stack emerged as companies could now collect data from multiple sources, transform it into usable formats, and analyze it in near real-time, all within the cloud environment. It became the standard approach for organizations seeking to harness data for insights and innovation.

...and companies invested in the emerging Modern Data Stack.

ML/AI brings Python developers and their tooling into the enterprise

Driven by ML/AI the number of Python developers has increased from 7 million in 2017 to 18.2M in Q1 2024. They are entering modern organizations in masse. Organizations often employ them for data-related jobs, especially in data engineering, machine learning, data science and analytics. They must work with established data sources, data stores, and data pipelines that are essential to the business of these organizations.

Python practitioners have amazing libraries for number crunching (e.g. NumPP, pandas, etc), model training (Huggingface, etc), sharing work results (Streamlit, etc). They are using Pythonic notebooks (such as Jupyter Notebooks or Google Collab) for intuitive and accessible data exploration and visualization.


The emergence of the Pythonic Composable Data Stack

We see the same pattern play out over and over again. Lack of proper tools to do data jobs is a severe problem for their organizations. Python practitioners who want to use business data must either wait for a data engineer to help them load it. Or, more frequently, they need to write a Python script themselves and connect it to tools they have been trained on like Jupyter Notebooks. Even more frequently, expensive employees with rare skills (e.g. data engineers, MLOps, etc) are tasked with creating internal data platforms based on custom Python scripts. These workflows are difficult to maintain and do not scale in organizations.

Increasingly, organisations are implementing the Pythonic Composable Data Stack. In the last few years, the Pythonic data lake has surged in popularity, driven by advancements in tools like Apache Arrow and DuckDB, which facilitate efficient in-memory processing and querying of large datasets. The integration of these tools with cloud storage solutions and frameworks like PyArrow has enabled seamless, scalable data lakes accessible through Python. A new set of related tooling is emerging.


Why dlt

Before dlt there was no “Jupyter Notebook, Pandas, NumPy, Streamlit, Huggingface etc. for data loading”. There was nothing for robust data source extraction, schema modelling, and loading into a data store for further use and sharing.

Unlike other non-Python solutions, with dlt, there's no need to use any backends or containers. We do not replace a company's data platform, deployments, or security models.

As a Python library dlt can be used in parallel with existing Python scripts or ETL tooling.

dlt works with both the Modern Data Stack and the Composable Data Stack

dlt acts as a bridge between the two worlds. it augments a companies existing tech investments into the Modern Data Stack. For example we have great helper for Airflow or dbt.

dltHub is our user community that grows around such a core tool that bridges both worlds and where solutions, code, and advice are shared.

Eventually we think of dltHub as the GitHub for data pipelines, as an ecosystem of pipeline creators and maintainers as well as the other data folks. Here, eventually millions of code snippets will be created and shared. As a first step we aim to launch a dedicated dltHub - a place where 10,000s of code snippets are shared - before the end of 2024.

To solve data ingestion in the modern organization, dlt needs to be not only loved by Python practitioners but also data engineers and companies building internal data platforms. dlt’s data platform-in-a-box offers a set of products & services aimed at the needs of data platform teams.

Become a dltHub Technology partner

We are not only solving the data ingestion for ourselves but also the wider ecosystem.

dlt is a library. When our partners add a library to their code, it belongs to them. We want to be a part of other libraries - be it open source or internal libraries or products.

Dagster added dlt to its embedded elt platform, so did Prefect. dlt is under the hood of Posthog's data warehouse product.

For our partners we offer insider resources, unique support and certifications. For example write helpers to make your companies tools and workflows more effective with our library (eg dbt, Airflow, Streamlit).

Become a dltHub Consulting partner

From Q3 2024 we are starting to partner with data consultancies to ensure our customers get the most value from dlt.

Gain insider resources and unique opportunities while helping customers get the most from dlt.

Take advantage of partner-only training and enablement resources. Then help your team stand out by becoming certified dlt developers.

untitled data company worked with us as a design partner on the REST API toolkit, so it can build data pipelines for their consulting clients in very little time and very little code.

Work closely with our team on joint marketing activities, sales opportunities, product development, and joint solutions.

Built by engineers for engineers. Open source.