Skip to main content

4 posts tagged with "dlthub"

View All Tags

· 5 min read
Matthaus Krzykowski

Celebrating over 500 ad hoc custom sources written by the dlt community in February

Today it is easier to pip install dlt and write a custom source than to setup and configure a traditional ETL platform.

The wider community is increasingly noticing these benefits. In February the community wrote over 500 dlt custom sources. Last week we crossed 2000 dlt total custom sources created since we launched dlt last summer.

custom sources

A custom dlt source is something new for our industry. With dlt we automated the majority of the work data engineers tasks that are usually done in traditional ETL platforms. Hence, creating an ad hoc dlt pipeline and source is a dramatically simpler. Maintaining a custom dlt source in production is relatively easy as most of the common pipeline maintenance issues are handled.

Today dlt users pick dlt because it is the fastest way to create a dataset. As we frequently hear it from all of you “dlt is pip install and go”. This is in line with our mission to make this next generation of Python users autonomous when they create and use data in their organizations.

How to get to 50,000 sources: let’s remove the dependency on source catalogs and move forward to ad hoc code

We think that “Pip install ETLs” or “EL as code” tools such as dlt are ushering a new era of ad hoc code. ad hoc code allows for automation and customization of very specific tasks.

Most of the market today is educated by Saas ETLs on the value of “source”/”connector” catalogs. The core is a short-tail catalog market of +-20 sources (product database replication, some popular CRMs and ads APIs) with the highest profit margins and intense competition among vendors. The long-tail source catalog market, depending on the vendor, is usually up to 400 sources, with much smaller support.

We think that source catalogs will become more and more irrelevant in the era of LLMs and ad hoc code. “EL as code” allows users to work with source catalog. From the beginning the dlt community has been writing wrappers for taps/connectors from other vendors, usually to migrate to a dlt pipeline at some point, as we documented in the customer story how Harness adopted dlt.

moving away from sources

Even for short-tail, high quality catalog sources “EL as code” allows for fixes of hidden gotchas and customisation that makes data pipelines production-ready.

We also believe that these are all early steps in “EL as code”. Huggingface hosts over 116k datasets as of March ‘24. We at dltHub think that the ‘real’ Pythonic ETL market is a market of 100k of APIs and millions of datasets.

dlt has been built for humans and LLMs from the get go and this will make coding data pipelines even faster

Since the inception of dlt, we have believed that the adoption dltamong the next generation of Python users will depend on its compatibility with code generation tools, including Codex, ChatGPT, and any new tools that emerge on the market..

We have not only been building dlt for humans, but also LLMs.

Back in March ‘23 we released dlt init as the simplest way to add a pipeline/initialize a project in dlt. We rebuilt the dlt library in such a way that it performs well with LLMs. At the end of May ‘23 we opened up our dltHub Slack to the broader community.

Back in June ‘23 we released a proof of concept of the 'dlt init' extension that can generate dlt pipelines from an OpenAPI specification. As we said at that time, if you build APIs, for example with FastAPI, you can, thanks to the OpenAPI spec, automatically generate a Python client and give it to your users. If you have 3min time watch how a demo Marcin generates such a pipeline from the OpenAPI spec using the Pokemon API in this Loom. This demo took things a step further and enables users to generate advanced dlt pipelines that, in essence, convert your API into a live dataset.

However, it takes a long time to go from a LLM PoC to production-grade code. We know much of our user base is already using ChatPGT and comparable tools to generate sources. We hear our community's excitement about the promise of LLMs for this task. The automation promise is in both in building and configuring pipelines. Anything seems possible, but if any of you have played around this task with ChatPGT - usually the results are janky. Along these lines in the last couple of months we have been dog fooding the PoC that can generate dlt pipelines from an OpenAPI specification.

comics

https://twitter.com/xatkit/status/1763973816798138370

You can read a case study on how our solution engineer Violetta used an iterated version of the PoC to generate a production-grade Chargebee dlt within hours instead of 2,3 days here.

We think that at this stage we are a few weeks away from releasing our next product that makes coding data pipelines faster than renting connector catalog: a dlt code generation tool that allows dlt users create datasets from the REST API in the coming weeks.

· 7 min read
William Laroche
info

TL;DR: William, a gcp data consultant, shares an article about the work he did with dlt and GCP to create a secure, scalable, lightweight, and powerful high-volume event ingestion engine.

He explores several alternatives before offering a solution, and he benchmarks the solution after a few weeks of running.

Read the original post here: dataroc.ca blog. Or find/hire William on Linkedin.

In the ever-evolving landscape of cloud computing, optimizing data workflows is paramount for achieving efficiency and scalability. Even though Google Cloud Platform offers the powerful Dataflow service to process data at scale, sometimes the simplest solution is worth a shot.

In cases with a relatively high Pub/Sub volume (>10 messages per second), a pull subscription with a continuously running worker is more cost-efficient and quicker than a push subscription. Using a combination of Docker, Instance Templates and Instance Groups, it is pretty simple to set up an auto-scaling group of instances that will process Pub/Sub messages.

This guide will walk you through the process of configuring GCP infrastructure that efficiently pulls JSON messages from a Pub/Sub subscription, infers schema, and inserts them directly into a Cloud SQL PostgreSQL database using micro-batch processing.

The issue at hand

In my current role at WishRoll, I was faced with the issue of processing a high amount of events and store them in the production database directly.

Imagine the scene: the server application produces analytics-style events such as "user logged-in", and "task X was completed" (among others). Eventually, for example, we want to run analytics queries on those events to count how many times a user logs in to better tailor their experience.

A. The trivial solution: synchronous insert

The trivial solution is to synchronously insert these events directly in the database. A simple implementation would mean that each event fired results in a single insert to the database. This comes with 2 main drawbacks:

  • Every API call that produces an event becomes slower. I.e. the /login endpoint needs to insert a record in the database
  • The database is now hit with a very high amount of insert queries

With our most basic need of 2 event types, we were looking at about 200 to 500 events per second. I concluded this solution would not be scalable. To make it so, 2 things would be necessary: (1) make the event firing mechanism asynchronous and (2) bulk events together before insertion.

B. The serverless asynchronous solution

A second solution is to use a Pub/Sub push subscription to trigger an HTTP endpoint when a message comes in. This would've been easy in my case because we already have a worker-style autoscaled App Engine service that could've hosted this. However, this only solves the 1st problem of the trivial solution; the events still come in one at a time to the HTTP service.

Although it's possible to implement some sort of bulking mechanism in a push endpoint, it's much easier to have a worker pull many messages at once instead.

C. The serverless, fully-managed Dataflow solution

This led me to implement a complete streaming pipeline using GCP's streaming service: Dataflow. Spoiler: this was way overkill and led to weird bugs with DLT (data load tool). If you're curious, I've open-sourced that code too.

This solved both issues of the trivial solution, but proved pretty expensive and hard to debug and monitor.

D. An autoscaled asynchronous pull worker

Disclaimer: I had never considered standalone machines from cloud providers (AWS EC2, GCP Compute Engine) to be a viable solution to my cloud problems. In my head, they seemed like outdated, manually provisioned services that could instead be replaced by managed services.

But here I was, with a need to have a continuously running worker. I decided to bite the bullet and try my luck with GCP Compute Engine. What I realized to my surprise, is that by using instance templates and instance groups, you can easily set up a cluster of workers that will autoscale.

The code is simple: run a loop forever that pulls messages from a Pub/Sub subscription, bulk the messages together, and then insert them in the database. Repeat.

Then deploy that code as an instance group that auto-scales based on the need to process messages.

Code walkthrough

The complete source code is available here.

Summarily, the code is comprised of 2 main parts:

  • The pulling and batching logic to accumulate and group messages from Pub/Sub based on their destination table
  • The load logic to infer the schema and bulk insert the records into the database. This part leverages DLT for destination compatibility and schema inference

Main loop

By using this micro-batch architecture, we strive to maintain a balance of database insert efficiency (by writing multiple records at a time) with near real-time insertion (by keeping the window size around 5 seconds).


pipeline = dlt.pipeline(
pipeline_name="pubsub_dlt",
destination=DESTINATION_NAME,
dataset_name=DATASET_NAME,
)

pull = StreamingPull(PUBSUB_INPUT_SUBCRIPTION)
pull.start()

try:
while pull.is_running:
bundle = pull.bundle(timeout=WINDOW_SIZE_SECS)
if len(bundle):
load_info = pipeline.run(bundle.dlt_source())
bundle.ack_bundle()
# pretty print the information on data that was loaded
print(load_info)
else:
print(f"No messages received in the last {WINDOW_SIZE_SECS} seconds")

finally:
pull.stop()

How to deploy

The GitHub repo explains how to deploy the project as an instance group.

Database concerns

Using DLT has the major advantage of inferring the schema of your JSON data automatically. This also comes with some caveats:

  • The output schema of these analytics tables might change based on events
  • If your events have a lot of possible properties, the resulting tables could become very wide (lots of columns) which is not something desirable in an OLTP database

Given these caveats, I make sure that all events fired by our app are fully typed and limited in scope. Moreover, using the table_name_data_key configuration of the code I wrote, it's possible to separate different events with different schemas into different tables.

See this README section for an example of application code and the resulting table.

Performance and cost

After running this code and doing backfills for a couple of weeks, I was able to benchmark the overall efficiency and cost of this solution.

Throughput capacity

The pull worker performance

The Pub/Sub subscription metrics. Message throughput ranges between 200 and 300 per second, while the oldest message is usually between 5 and 8 seconds with occasional spikes.

I am running a preemptible (SPOT) instance group of n1-standard-1 machines that auto-scales between 2 and 10 instances. In normal operation, a single worker can handle our load easily. However, because of the preemptible nature of the instances, I set the minimum number to 2 to avoid periods where no worker is running.

Maximum capacity

When deploying the solution with a backlog of messages to process (15 hours worth of messages), 10 instances were spawned and cleared the backlog in about 25 minutes.

The Pub/Sub subscription throughput metrics when a 15-hour backlog was cleared. The instance group gradually reached 10 instances at about 10:30AM, then cleared the backlog by 10:50AM.

Between 7000 and 10000 messages per second were processed on average by these 10 instances, resulting in a minimum throughput capacity of 700 messages/s per worker.

Cost

Using n1-standard-1 spot machines, this cluster costs $8.03/mth per active machine. With a minimum cluster size of 2, this means $16.06 per month.

Conclusion

Using more "primitive" GCP services around Compute Engine provides a straightforward and cost-effective way to process a high throughput of Pub/Sub messages from a pull subscription.

info

PS from dlt team:

  • We just added data contracts enabling to manage schema evolution behavior.
  • Are you on aws? Check out this AWS SAM & Lambda event ingestion pipeline here.

· 3 min read
Matthaus Krzykowski

The number of Python developers increased from 7 million in 2017 to 15.7 million in Q1 2021 and grew by 3 million (20%) between Q4 2021 and Q1 2022 alone, making it the most popular programming language in Q3 2022. A large percentage of this new group are what we call Python practitionersdata folks and scripters. This group uses Python to do tasks in their jobs, but they do not consider themselves to be software engineers.

They are entering modern organizations in masse. Organizations often employ them for data-related jobs, especially in data engineering, data science / ML, and analytics. They must work with established data sources, data stores, and data pipelines that are essential to the business of these organizations These companies, though, are not providing them with the type of tooling they learnt to expect. There’s no “Jupyter Notebook, pandas, NumPy, etc. for data loading” for them to use.

At this stage of dlt we are focused on serving the needs of organizations with 150 employees or less. Companies of this size typically begin making their first data hires. They want data to be at their core: their CEOs may want to make their companies more “data driven” and “user feedback centric”. Their CTOs may want to “build a data warehouse for automation and self service”. They frequently are eager to take advantage of the skills of the Python practioners they have hired.

To achieve our mission of making this next generation of Python users autonomous in these organizations, we believe we need to build dlt in a “Pythonic” way. Anyone that can write a loop in Python script should be able to write a source and load it. There should minimal learning curve. Anyone in these organizations that gets basic Python should be able to use dlt right away.

However, we also recognize the need dlt to be loved not only by Python users but also data engineers to fulfill our mission. This is crucial because eventually these folks will be brought in to help with data loading in an organization. We need data engineers to evolve dlt pipelines rather than ripping them out and replacing them like they almost always do to scripts written by Python practitioners today.

To develop with dlt, anyone can install it like any other Python library with pip install dlt. They can then run dlt init and be ready to go. Already today data engineers love the automatic schema inference and evolution as well as the customizability of dlt.

· 3 min read
Matthaus Krzykowski

dltHub Mission

Since 2017, the number of Python users has been increasing by millions annually. The vast majority of these people leverage Python as a tool to solve problems at work. Our mission is to make this next generation of Python users autonomous when they create and use data in their organizations. For this end, we are building an open source Python library called data load tool (dlt).

These Python practitioners, as we call them, use dlt in their scripts to turn messy, unstructured data into regularly updated datasets. dlt empowers them to create highly scalable, easy to maintain, straightforward to deploy data pipelines without having to wait for help from a data engineer. When organizations eventually bring in data engineers to help with data loading, these engineers build on their work and evolve dlt pipelines.

We are dedicated to keeping dlt an open source project surrounded by a vibrant, engaged community. To make this sustainable, dltHub stewards dlt while also offering additional software and services that generate revenue (similar to what GitHub does with Git).

Why does dltHub exist?

We believe in a world where data loading becomes a commodity. A world where hundreds of thousands of pipelines will be created, shared, and deployed. A world where data sets, reports, and analytics will be written and shared publicly and privately.

To achieve our mission to make this next generation of Python users autonomous when they create and use data in their organizations, we need to address the requirements of both the Python practitioner and the data engineer with a minimal Python library. We also need dltHub to become the GitHub for data pipelines, facilitating and supporting the ecosystem of pipeline creators and maintainers as well as the other data folks who consume and analyze the data loaded.

There are lots of ETL/ELT tools available (300+!). Yet, as we engaged with Python practioners over the last one and half years, we found few Python practitioners that use traditional data ingestion tools. Only a handful have even heard of them. Very simplified, there’s two approaches in traditional data ingestion tools and neither works for this new generation: 1) SaaS solutions that handle the entire data loading process and 2) object-oriented frameworks for software engineers.

SaaS solutions do not give Python practitioners enough credit, while frameworks expect too much of them. In other words, there's no “Jupyter Notebook, pandas, NumPy, etc. for data loading” that meets users needs. As millions of Python practioners are now entering organizations every year, we think this should exist.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.