Skip to main content

· 4 min read
Aman Gupta

Hello, I'm Aman Gupta. Over the past eight years, I have navigated the structured world of civil engineering, but recently, I have found myself captivated by data engineering. Initially, I knew how to stack bricks and build structural pipelines. But this newfound interest has helped me build data pipelines, and most of all, it was sparked by a workshop hosted by dlt.

info

dlt (data loading tool) is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.

The dlt workshop took place in November 2022, co-hosted by Adrian Brudaru, my former mentor and co-founder of dlt.

An opportunity arose when another client needed data migration from FreshDesk to BigQuery. I crafted a basic pipeline version, initially designed to support my use case. Upon presenting my basic pipeline to the dlt team, Alena Astrakhatseva, a team member, generously offered to review it and refine it into a community-verified source.

image

My first iteration was straightforward—loading data in replace mode. While adequate for initial purposes, a verified source demanded features like pagination and incremental loading. To achieve this, I developed an API client tailored for the Freshdesk API, integrating rate limit handling and pagination:

class FreshdeskClient:
"""
Client for making authenticated requests to the Freshdesk API. It incorporates API requests with
rate limit and pagination.
"""

def __init__(self, api_key: str, domain: str):
# Contains stuff like domain, credentials and base URL.
pass

def _request_with_rate_limit(self, url: str, **kwargs: Any) -> requests.Response:
# Handles rate limits in HTTP requests and ensures that the client doesn't exceed the limit set by the server.
pass

def paginated_response(
self,
endpoint: str,
per_page: int,
updated_at: Optional[str] = None,
) -> Iterable[TDataItem]:
# Fetches a paginated response from a specified endpoint.
pass

To further make the pipeline effective, I developed dlt resources that could handle incremental data loading. This involved creating resources that used dlt's incremental functionality to fetch only new or updated data:

def incremental_resource(
endpoint: str,
updated_at: Optional[Any] = dlt.sources.incremental(
"updated_at", initial_value="2022-01-01T00:00:00Z"
),
) -> Generator[Dict[Any, Any], Any, None]:
"""
Fetches and yields paginated data from a specified API endpoint.
Each page of data is fetched based on the `updated_at` timestamp
to ensure incremental loading.
"""

# Retrieve the last updated timestamp to fetch only new or updated records.
updated_at = updated_at.last_value

# Use the FreshdeskClient instance to fetch paginated responses
yield from freshdesk.paginated_response(
endpoint=endpoint,
per_page=per_page,
updated_at=updated_at,
)

With the steps defined above, I was able to load the data from Freshdesk to BigQuery and use the pipeline in production. Here’s a summary of the steps I followed:

  1. Created a Freshdesk API token with sufficient privileges.
  2. Created an API client to make requests to the Freshdesk API with rate limit and pagination.
  3. Made incremental requests to this client based on the “updated_at” field in the response.
  4. Ran the pipeline using the Python script.

While my journey from civil engineering to data engineering was initially intimidating, it has proved to be a profound learning experience. Writing a pipeline with dlt mirrors the simplicity of a GET request: you request data, yield it, and it flows from the source to its destination. Now, I help other clients integrate dlt to streamline their data workflows, which has been an invaluable part of my professional growth.

In conclusion, diving into data engineering has expanded my technical skill set and provided a new lens through which I view challenges and solutions. As for me, the lens view mainly was concrete and steel a couple of years back, which has now begun to notice the pipelines of the data world.

Data engineering has proved both challenging, satisfying and a good carrier option for me till now. For those interested in the detailed workings of these pipelines, I encourage exploring dlt's GitHub repository or diving into the documentation.

· 9 min read
Adrian Brudaru

What is the REST API Source toolkit?

tip

tl;dr: You are probably familiar with REST APIs.

  • Our new REST API Source is a short, declarative configuration driven way of creating sources.
  • Our new REST API Client is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high-level abstraction for building pipelines.

Want to skip to docs? Links at the bottom of the post.

Why REST configuration pipeline? Obviously, we need one!

But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different APIs with different configs?

Once you have built a few pipelines from REST APIs, you can recognise we could, instead of writing code, write configuration.

We can call such an obvious next step in ETL tools a “focal point” of “convergent evolution”.

And if you’ve been in a few larger more mature companies, you will have seen a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings.

But ours will be better…

So far we have seen many REST API configurators and products — they suffer from predictable flaws:

  • Local homebrewed flavors are local for a reason: They aren’t suitable for the broad audience. And often if you ask the users/beneficiaries of these frameworks, they will sometimes argue that they aren’t suitable for anyone at all.
  • Commercial products are yet another data product that doesn’t plug into your stack, brings black boxes and removes autonomy, so they simply aren’t an acceptable solution in many cases.

So how can dlt do better?

Because it can keep the best of both worlds: the autonomy of a library, the quality of a commercial product.

As you will see further, we created not just a standalone “configuration-based source builder” but we also expose the REST API client used enabling its use directly in code.

Hey community, you made us do it!

The push for this is coming from you, the community. While we had considered the concept before, there were many things dlt needed before creating a new way to build pipelines. A declarative extractor after all, would not make dlt easier to adopt, because a declarative approach requires more upfront knowledge.

Credits:

  • So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23: https://github.com/dlt-hub/dlt-init-openapi/pull/2.
  • And thank you Francesco Mucio and Willi Müller for re-opening the topic, and creating video tutorials.
  • And last but not least, thank you to dlt team’s Anton Burnashev (also known for gspread library) for building it out!

The outcome? Two Python-only interfaces, one declarative, one imperative.

  • dlt’s REST API Source is a Python dictionary-first declarative source builder, that has enhanced flexibility, supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes.
  • dlt’s REST API Client is the low-level abstraction that powers the REST API Source. You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface.

Useful for those who frequently build new pipelines

If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. We heard excitement about the REST API Source from:

  • companies with many pipelines that frequently create new pipelines,
  • data platform teams,
  • freelancers and agencies,
  • folks who want to generate pipelines with LLMs and need a simple interface.

How to use the REST API Source?

Since this is a declarative interface, we can’t make things up as we go along, and instead need to understand what we want to do upfront and declare that.

In some cases, we might not have the information upfront, so we will show you how to get that info during your development workflow.

Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below.

Video walkthroughs

In these videos, you will learn at a leisurely pace how to use the new interface. Playlist link.

Workflow walkthrough: Step by step

If you prefer to do things at your own pace, try the workflow walkthrough, which will show you the workflow of using the declarative interface.

In the example below, we will show how to create an API integration with 2 endpoints. One of these is a child resource, using the data from the parent endpoint to make a new request.

Configuration Checklist: Before getting started

In the following, we will use the GitHub API as an example.

We will also provide links to examples from this Google Colab tutorial.

  1. Collect your api url and endpoints, Colab example:

    • An URL is the base of the request, for example: https://api.github.com/.
    • An endpoint is the path of an individual resource such as:
      • /repos/{OWNER}/{REPO}/issues;
      • or /repos/{OWNER}/{REPO}/issues/{issue_number}/comments which would require the issue number from the above endpoint;
      • or /users/{username}/starred etc.
  2. Identify the authentication methods, Colab example:

  3. Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details. For GitHub, we might do the below or any other dependent requests. Colab example.:

    1. Get all repos of an org https://api.github.com/orgs/{org}/repos.
    2. Then get all contributors https://api.github.com/repos/{owner}/{repo}/contributors.
  4. How does pagination work? Is there any? Do we know the exact pattern? Colab example.

    • On GitHub, we have consistent pagination between endpoints that looks like this link_header = response.headers.get('Link', None).
  5. Identify the necessary information for incremental loading, Colab example:

    • Will any endpoints be loaded incrementally?
    • What columns will you use for incremental extraction and loading?
    • GitHub example: We can extract new issues by requesting issues after a particular time: https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}.

Configuration Checklist: Checking responses during development

  1. Data path:
  2. Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works.
    1. To do that, we suggest using curl or a second python script to do a request and inspect the response. This gives you flexibility to try anything. Colab example.
    2. Or you could print the source as above - but if there is metadata in headers etc, you might miss it.

Applying the configuration

Here’s what a configured example could look like:

  1. Base URL and endpoints.
  2. Authentication.
  3. Pagination.
  4. Incremental configuration.
  5. Dependent resource (child) configuration.

If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component (n).

# This source has 2 resources:
# - issues: Parent resource, retrieves issues incl. issue number
# - issues_comments: Child resource which needs the issue number from parent.

import os
from rest_api import RESTAPIConfig

github_config: RESTAPIConfig = {
"client": {
"base_url": "https://api.github.com/repos/dlt-hub/dlt/", #(1)
# Optional auth for improving rate limits
# "auth": { #(2)
# "token": os.environ.get('GITHUB_TOKEN'),
# },
},
# The paginator is autodetected, but we can pass it explicitly #(3)
# "paginator": {
# "type": "header_link",
# "next_url_path": "paging.link",
# }
# We can declare generic settings in one place
# Our data is stateful so we load it incrementally by merging on id
"resource_defaults": {
"primary_key": "id", #(4)
"write_disposition": "merge", #(4)
# these are request params specific to GitHub
"endpoint": {
"params": {
"per_page": 10,
},
},
},
"resources": [
# This is the first resource - issues
{
"name": "issues",
"endpoint": {
"path": "issues", #(1)
"params": {
"sort": "updated",
"direction": "desc",
"state": "open",
"since": {
"type": "incremental", #(4)
"cursor_path": "updated_at", #(4)
"initial_value": "2024-01-25T11:21:28Z", #(4)
},
}
},
},
# Configuration for fetching comments on issues #(5)
# This is a child resource - as in, it needs something from another
{
"name": "issue_comments",
"endpoint": {
"path": "issues/{issue_number}/comments", #(1)
# For child resources, you can use values from the parent resource for params.
"params": {
"issue_number": {
# Use type "resolve" to define child endpoint wich should be resolved
"type": "resolve",
# Parent endpoint
"resource": "issues",
# The specific field in the issues resource to use for resolution
"field": "number",
}
},
},
# A list of fields, from the parent resource, which will be included in the child resource output.
"include_from_parent": ["id"],
},
],
}

And that’s a wrap — what else should you know?

  • As we mentioned, there’s also a REST Client - an imperative way to use the same abstractions, for example, the auto-paginator - check out this runnable snippet:

    from dlt.sources.helpers.rest_client import RESTClient

    # Initialize the RESTClient with the Pokémon API base URL
    client = RESTClient(base_url="https://pokeapi.co/api/v2")

    # Using the paginate method to automatically handle pagination
    for page in client.paginate("/pokemon"):
    print(page)
  • We are going to generate a bunch of sources from OpenAPI specs — stay tuned for an update in a couple of weeks!

Next steps

· 12 min read
Anuun Chinbat

It's been nearly half a century since cron was first introduced, and now we have a handful orchestration tools that go way beyond just scheduling tasks. With data folks constantly debating about which tools are top-notch and which ones should leave the scene, it's like we're at a turning point in the evolution of these tools. By that I mean the term 'orchestrator' has become kind of a catch-all, and that's causing some confusion because we're using this one word to talk about a bunch of different things.

dates

Think about the word “date.” It can mean a fruit, a romantic outing, or a day on the calendar, right? We usually figure out which one it is from the context, but what does context mean when it comes to orchestration? It might sound like a simple question, but it's pretty important to get this straight.

And here's a funny thing: some people, after eating an odd-tasting date (the fruit, of course), are so put off that they naively swear off going on romantic dates altogether. It's an overly exaggerated figurative way of looking at it, but it shows how one bad experience can color our view of something completely different. That's kind of what's happening with orchestration tools. If someone had a bad time with one tool, they might be overly critical towards another, even though it might be a totally different experience.

So the context in terms of orchestration tools seems to be primarily defined by one thing - WHEN a specific tool was first introduced to the market (aside from the obvious factors like the technical background of the person discussing these tools and their tendency to be a chronic complainer 🙄).


IT'S ALL ABOUT TIMING!

evolution-of-data-orchestration

The Illegitimate Child

Cron was initially released in 1975 and is undoubtedly the father of all scheduling tools, including orchestrators, but I’m assuming Cron didn’t anticipate this many offspring in the field of data (or perhaps it did). As Oracle brought the first commercial relational database to market in 1979, people started to realize that data needs to be moved on schedule, and without manual effort. And it was doable, with the help of Control-M, though it was more of a general workflow automation tool that didn’t pay special attention to data workflows.

Basically, since the solutions weren’t data driven at that time, it was more “The job gets done, but without a guarantee of data quality.”

Finally Adopted

Unlike Control-M, Informatica was designed for data operations in mind from the beginning. As data started to spread across entire companies, advanced OLAPs started to emerge with a broad use of datawarehousing. Now data not only needed to be moved, but integrated across many systems and users. The data orchestration solution from Informatica was inevitably influenced by the rising popularity of the contemporary drag-and-drop concept, that is, to the detriment of many modern data engineers who would recommend to skip Informatica and other GUI based ETL tools that offer ‘visual programming’.

As the creator of Airflow, Max Beauchemin, said: “There's a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it's that ultimately code is the best abstraction there is for software.

To Be Free, That Is, Diverse

With traditional ETL tools, such as IBM DataStage and Talend, becoming well-established in the 1990s and early 2000s, the big data movement started gaining its momentum with Hadoop as the main star. Oozie, later made open-source in 2011, was tasked with workflow scheduling of Hadoop jobs, with closed-source solutions, like K2View starting to operate behind the curtains.

Fast forward a bit, and the scene exploded, with Airflow quickly becoming the heavyweight champ, while every big data service out there began rolling out their own orchestrators. This burst brought diversity, but with diversity came a maze of complexity. All of a sudden, there’s an orchestrator for everyone — whether you’re chasing features or just trying to make your budget work 👀 — picking the perfect one for your needs has gotten even trickier.

types

The Bottom Line

The thing is that every tool out there has some inconvenient truths, and real question isn't about escaping the headache — it's about choosing your type of headache. Hence, the endless sea of “versus” articles, blog posts, and guides trying to help you pick your personal battle.

A Redditor: “Everyone has hated all orchestration tools for all time. People just hated Airflow less and it took off.“

What I'm getting at is this: we're all a bit biased by the "law of the instrument." You know, the whole “If all you have is a hammer, everything looks like a nail” thing. Most engineers probably grabbed the latest or most hyped tool when they first dipped their toes into data orchestration and have stuck with it ever since. Sure, Airflow is the belle of the ball for the community, but there's a whole lineup of contenders vying for the spotlight.

law-of-instrument

And there are obviously those who would relate to the following:

reddit-screenshot


A HANDY DETOUR POUR TOI 💐

The Fundamentals

About Airflow

Miscellaneous


WHAT THE FUTURE HOLDS...

I'm no oracle or tech guru, but it's pretty obvious that at their core, most data orchestration tools are pretty similar. They're like building blocks that can be put together in different ways—some features come, some go, and users are always learning something new or dropping something old. So, what's really going to make a difference down the line is NOT just about having the coolest features. It's more about having a strong community that's all in on making the product better, a welcoming onboarding process that doesn't feel like rocket science, and finding that sweet spot between making things simple to use and letting users tweak things just the way they like.

In other words, it's not just about what the tools can do, but how people feel about using them, learning them, contributing to them, and obviously how much they spend to maintain them. That's likely where the future winners in the data orchestration game will stand out. But don’t get me wrong, features are important — it's just that there are other things equally important.


I’ve been working on this article for a WHILE now, and, honestly, it's been a bit of a headache trying to gather any solid, objective info on which data orchestration tool tops the charts. The more I think about it, the more I realise it's probably because trying to measure "the best" or "most popular" is a bit like trying to catch smoke with your bare hands — pretty subjective by nature. Plus, only testing them with non-production level data probably wasn't my brightest move.

However, I did create a fun little project where I analysed the sentiment of comments on articles about selected data orchestrators on Hacker News and gathered Google Trends data for the past year.

Just a heads-up, though: the results are BY NO MEANS reliable and are skewed due to some fun with words. For instance, searching for “Prefect” kept leading me to articles about Japanese prefectures, “Keboola” resulted in Kool-Aid content, and “Luigi”... well, let’s just say I ran into Mario’s brother more than once 😂.


THE FUN LITTLE PROJECT

Straight to the GitHub repo.

I used Dagster and dlt to load data into Snowflake, and since both of them have integrations with Snowflake, it was easy to set things up and have them all running:

Pipeline overview

This project is very minimal, including just what's needed to run Dagster locally with dlt. Here's a quick breakdown of the repo’s structure:

  1. .dlt: Utilized by the dlt library for storing configuration and sensitive information. The Dagster project is set up to fetch secret values from this directory as well.
  2. charts: Used to store chart images generated by assets.
  3. dlt_dagster_snowflake_demo: Your Dagster package, comprising Dagster assets, dlt resources, Dagster resources, and general project configurations.

Dagster Resources Explained

In the resources folder, the following two Dagster resources are defined as classes:

  1. DltPipeline: This is our dlt object defined as a Dagster ConfigurableResource that creates and runs a dlt pipeline with the specified data and table name. It will later be used in our Dagster assets to load data into Snowflake.

    class DltPipeline(ConfigurableResource):
    # Initialize resource with pipeline details
    pipeline_name: str
    dataset_name: str
    destination: str

    def create_pipeline(self, resource_data, table_name):
    """
    Creates and runs a dlt pipeline with specified data and table name.

    Args:
    resource_data: The data to be processed by the pipeline.
    table_name: The name of the table where data will be loaded.

    Returns:
    The result of the pipeline execution.
    """

    # Configure the dlt pipeline with your destination details
    pipeline = dlt.pipeline(
    pipeline_name=self.pipeline_name,
    destination=self.destination,
    dataset_name=self.dataset_name
    )

    # Run the pipeline with your parameters
    load_info = pipeline.run(resource_data, table_name=table_name)
    return load_info
  2. LocalFileStorage: Manages the local file storage, ensuring the storage directory exists and allowing data to be written to files within it. It will be later used in our Dagster assets to save images into the charts folder.

dlt Explained

In the dlt folder within dlt_dagster_snowflake_demo, necessary dlt resources and sources are defined. Below is a visual representation illustrating the functionality of dlt:

dlt explained

  1. hacker_news: A dlt resource that yields stories related to specified orchestration tools from Hackernews. For each tool, it retrieves the top 5 stories that have at least one comment. The stories are then appended to the existing data.

    Note that the write_disposition can also be set to merge or replace:

    • The merge write disposition merges the new data from the resource with the existing data at the destination. It requires a primary_key to be specified for the resource. More details can be found here.
    • The replace write disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

    More details can be found here.

  2. comments: A dlt transformer - a resource that receives data from another resource. It fetches comments for each story yielded by the hacker_news function.

  3. hacker_news_full: A dlt source that extracts data from the source location using one or more resource components, such as hacker_news and comments. To illustrate, if the source is a database, a resource corresponds to a table within that database.

  4. google_trends: A dlt resource that fetches Google Trends data for specified orchestration tools. It attempts to retrieve the data multiple times in case of failures or empty responses. The retrieved data is then appended to the existing data.

As you may have noticed, the dlt library is designed to handle the unnesting of data internally. When you retrieve data from APIs like Hacker News or Google Trends, dlt automatically unpacks the nested structures into relational tables, creating and linking child and parent tables. This is achieved through unique identifiers (_dlt_id and _dlt_parent_id) that link child tables to specific rows in the parent table. However, it's important to note that you have control over how this unnesting is done.

The Results

Alright, so once you've got your Dagster assets all materialized and data loaded into Snowflake, let's take a peek at what you might see:

sentiment counts

I understand if you're scratching your head at first glance, but let me clear things up. Remember those sneaky issues I mentioned with Keboola and Luigi earlier? Well, I've masked their charts with the respective “culprits”.

Now, onto the bars. Each trio of bars illustrates the count of negative, neutral, and positive comments on articles sourced from Hacker News that have at least one comment and were returned when searched for a specific orchestration tool, categorized accordingly by the specific data orchestration tool.

What's the big reveal? It seems like Hacker News readers tend to spread more positivity than negativity, though neutral comments hold their ground.

And, as is often the case with utilizing LLMs, this data should be taken with a grain of salt. It's more of a whimsical exploration than a rigorous analysis. However, if you take a peek behind Kool Aid and Luigi, it's intriguing to note that articles related to them seem to attract a disproportionate amount of negativity. 😂


IF YOU'RE STILL HERE

… and you're just dipping your toes into the world of data orchestration, don’t sweat it. It's totally normal if it doesn't immediately click for you. For beginners, it can be tricky to grasp because in small projects, there isn't always that immediate need for things to happen "automatically" - you build your pipeline, run it once, and then bask in the satisfaction of your results - just like I did in my project. However, if you start playing around with one of these tools now, it could make it much easier to work with them later on. So, don't hesitate to dive in and experiment!

… And hey, if you're a seasoned pro about to drop some knowledge bombs, feel free to go for it - because what doesn’t challenge us, doesn’t change us 🥹. (*Cries in Gen Z*)

· 3 min read
Adrian Brudaru

About Yummy.eu

Yummy is a Lean-ops meal-kit company streamlines the entire food preparation process for customers in emerging markets by providing personalized recipes, nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience, making daily cooking convenient and enjoyable.

Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. To make it in this market, Yummy needs to be fast and informed in their operations.

Pipelines are not yet a commodity.

At Yummy, efficiency and timeliness are paramount. Initially, Martin, Yummy’s CTO, chose to purchase data pipelining tools for their operational and analytical needs, aiming to maximize time efficiency. However, the real-world performance of these purchased solutions did not meet expectations, which led to a reassessment of their approach.

What’s important: Velocity, Reliability, Speed, time. Money is secondary.

Martin was initially satisfied with the ease of setup provided by the SaaS services.

The tipping point came when an update to Yummy’s database introduced a new log table, leading to unexpectedly high fees due to the vendor’s default settings that automatically replicated new tables fully on every refresh. This situation highlighted the need for greater control over data management processes and prompted a shift towards more transparent and cost-effective solutions.

10x faster, 182x cheaper with dlt + async + modal

Motivated to find a solution that balanced cost with performance, Martin explored using dlt, a tool known for its simplicity in building data pipelines. By combining dlt with asynchronous operations and using Modal for managed execution, the improvements were substantial:

  • Data processing speed increased tenfold.
  • Cost reduced by 182 times compared to the traditional SaaS tool.
  • The new system supports extracting data once and writing to multiple destinations without additional costs.

For a peek into on how Martin implemented this solution, please see Martin's async Postgres source on GitHub..

salo-martin-tweet

Taking back control with open source has never been easier

Taking control of your data stack is more accessible than ever with the broad array of open-source tools available. SQL copy pipelines, often seen as a basic utility in data management, do not generally differ significantly between platforms. They perform similar transformations and schema management, making them a commodity available at minimal cost.

SQL to SQL copy pipelines are widespread, yet many service providers charge exorbitant fees for these simple tasks. In contrast, these pipelines can often be set up and run at a fraction of the cost—sometimes just the price of a few coffees.

At dltHub, we advocate for leveraging straightforward, freely available resources to regain control over your data processes and budget effectively.

Setting up a SQL pipeline can take just a few minutes with the right tools. Explore these resources to enhance your data operations:

For additional support or to connect with fellow data professionals, join our community.

· 4 min read
Adrian Brudaru

Statistical Data and Metadata eXchange (SDMX) is an international standard used extensively by global organizations, government agencies, and financial institutions to facilitate the efficient exchange, sharing, and processing of statistical data.

Utilizing SDMX enables seamless integration and access to a broad spectrum of statistical datasets covering economics, finance, population demographics, health, and education, among others.

These capabilities make it invaluable for creating robust, data-driven solutions that rely on accurate and comprehensive data sources.

embeddable etl

Why SDMX?

SDMX not only standardizes data formats across disparate systems but also simplifies the access to data provided by institutions such as Eurostat, the ECB (European Central Bank), the IMF (International Monetary Fund), and many national statistics offices.

This standardization allows data engineers and scientists to focus more on analyzing data rather than spending time on data cleaning and preparation.

Installation and Basic Usage

To start integrating SDMX data sources into your Python applications, install the sdmx library using pip:

pip install sdmx1

Here's an example of how to fetch data from multiple SDMX sources, illustrating the diversity of data flows and the ease of access:

from sdmx_source import sdmx_source

source = sdmx_source([
{"data_source": "ESTAT", "dataflow": "PRC_PPP_IND", "key": {"freq": "A", "na_item": "PLI_EU28", "ppp_cat": "A0101", "geo": ["EE", "FI"]}, "table_name": "food_price_index"},
{"data_source": "ESTAT", "dataflow": "sts_inpr_m", "key": "M.PROD.B-D+C+D.CA.I15+I10.EE"},
{"data_source": "ECB", "dataflow": "EXR", "key": {"FREQ": "A", "CURRENCY": "USD"}}
])
print(list(source))

This configuration retrieves data from:

  • Eurostat (ESTAT) for the Purchasing Power Parity (PPP) and Price Level Indices providing insights into economic factors across different regions.
  • Eurostat's short-term statistics (sts_inpr_m) on industrial production, which is crucial for economic analysis.
  • European Central Bank (ECB) for exchange rates, essential for financial and trade-related analyses.

Loading the data with dlt, leveraging best practices

After retrieving data using the sdmx library, the next challenge is effectively integrating this data into databases. The dlt library excels in this area by offering a robust solution for data loading that adheres to best practices in several key ways:

  • Automated schema management -> dlt infers types and evolves schema as needed. It automatically handles nested structures too. You can customise this behavior, or turn the schema into a data contract.
  • Declarative configuration -> You can easily switch between write dispositions (append/replace/merge) or destinations.
  • Scalability -> dlt is designed to handle large volumes of data efficiently, making it suitable for enterprise-level applications and high-volume data streams. This scalability ensures that as your data needs grow, your data processing pipeline can grow with them without requiring significant redesign or resource allocation.

Martin Salo, CTO at Yummy, a food logistics company, uses dlt to efficiently manage complex data flows from SDMX sources. By leveraging dlt, Martin ensures that his data pipelines are not only easy to build, robust and error-resistant but also optimized for performance and scalability.

View Martin Salo's implementation

Martin Salo's implementation of the sdmx_source package effectively simplifies the retrieval of statistical data from diverse SDMX data sources using the Python sdmx library. The design is user-friendly, allowing both simple and complex data queries, and integrates the results directly into pandas DataFrames for immediate analysis.

This implementation enhances data accessibility and prepares it for analytical applications, with built-in logging and error handling to improve reliability.

Conclusion

Integrating sdmx and dlt into your data pipelines significantly enhances data management practices, ensuring operations are robust, scalable, and efficient. These tools provide essential capabilities for data professionals looking to seamlessly integrate complex statistical data into their workflows, enabling more effective data-driven decision-making.

By engaging with the data engineering community and sharing strategies and insights on effective data integration, data engineers can continue to refine their practices and achieve better outcomes in their projects.

Join the conversation and share your insights in our Slack community.

· 8 min read
Adrian Brudaru

embeddable etl

The versatility that enables "one way to rule them all"... requires a devtool

A unified approach to ETL processes centers around standardization without compromising flexibility. To achieve this, we need to be enabled to build and run custom code, bu also have helpers to enable us to standardise and simplify our work.

In the data space, we have a few custom code options, some of which portable. But what is needed to achieve universality and portability is more than just a code standard.

So what do we expect from such a tool?

  • It should be created for our developers
  • it should be easily pluggable into existing tools and workflows
  • it should perform across a variety of hardware and environments.

Data teams don't speak Object Oriented Programming (OOP)

Connectors are nice, but when don't exist or break, what do we do? We need to be able to build and maintain those connectors simply, as we work with the rest of our scripts.

The data person has a very mixed spectrum of activities and responsibilities, and programming is often a minor one. Thus, across a data team, while some members can read or even speak OOP, the team will not be able to do so without sacrificing other capabilities.

This means that in order to be able to cater to a data team as a dev team, we need to aknowledge a different abstraction is needed.

Goodbye OOP, hello @decorators!

Data teams often navigate complex systems and workflows that prioritize functional clarity over object-oriented programming (OOP) principles. They require tools that simplify process definition, enabling quick, readable, and maintainable data transformation and movement. Decorators serve this purpose well, providing a straightforward way to extend functionality without the overhead of class hierarchies and inheritance.

Decorators in Python allow data teams to annotate functions with metadata and operational characteristics, effectively wrapping additional behavior around core logic. This approach aligns with the procedural mindset commonly found in data workflows, where the emphasis is on the transformation steps and data flow rather than the objects that encapsulate them.

By leveraging decorators, data engineers can focus on defining what each part of the ETL process does—extract, transform, load—without delving into the complexities of OOP. This simplification makes the code more accessible to professionals who may not be OOP experts but are deeply involved in the practicalities of data handling and analysis.

The ability to run embedded is more than just scalability

Most traditional ETL frameworks are architected with the assumption of relatively abundant computational resources. This makes sense given the resource-intensive nature of ETL tasks when dealing with massive datasets.

However, this assumption often overlooks the potential for running these processes on smaller, more constrained infrastructures, such as directly embedded within an orchestrator or on edge devices.

The perspective that ETL processes necessarily require large-scale infrastructure is ripe for challenge. In fact, there is a compelling argument to be made for the efficiency and simplicity of executing ETL tasks, particularly web requests for data integration, on smaller systems. This approach can offer significant cost savings and agility, especially when dealing with less intensive data loads or when seeking to maintain a smaller digital footprint.

Small infrastructure ETL runs can be particularly efficient in situations where real-time data processing is not required, or where data volumes are modest. By utilizing the orchestrator's inherent scheduling and management capabilities, one can execute ETL jobs in a leaner, more cost-effective manner. This can be an excellent fit for organizations that have variable data processing needs, where the infrastructure can scale down to match lower demands, thereby avoiding the costs associated with maintaining larger, underutilized systems.

Running on small workers is easier than spinning up infra

Running ETL processes directly on an orchestrator can simplify architecture by reducing the number of moving parts and dependencies. It allows data teams to quickly integrate new data sources and destinations with minimal overhead. This methodology promotes a more agile and responsive data architecture, enabling businesses to adapt more swiftly to changing data requirements.

It's important to recognize that this lean approach won't be suitable for all scenarios, particularly where data volumes are large or where the complexity of transformations requires the robust computational capabilities of larger systems. Nevertheless, for a significant subset of ETL tasks, particularly those involving straightforward data integrations via web requests, running on smaller infrastructures presents an appealing alternative that is both cost-effective and simplifies the overall data processing landscape.

Dealing with spiky loads is easier on highly parallel infras like serverless functions

Serverless functions are particularly adept at managing spiky data loads due to their highly parallel and elastic nature. These platforms automatically scale up to handle bursts of data requests and scale down immediately after processing, ensuring that resources are utilized only when necessary. This dynamic scaling not only improves resource efficiency but also reduces costs, as billing is based on actual usage rather than reserved capacity.

The stateless design of serverless functions allows them to process multiple, independent tasks concurrently. This capability is crucial for handling simultaneous data streams during peak times, facilitating rapid data processing that aligns with sudden increases in load. Each function operates in isolation, mitigating the risk of one process impacting another, which enhances overall system reliability and performance.

Moreover, serverless architectures eliminate the need for ongoing server management and capacity planning. Data engineers can focus solely on the development of ETL logic without concerning themselves with underlying infrastructure issues. This shift away from operational overhead to pure development accelerates deployment cycles and fosters innovation.

Some examples of embedded portability with dlt

Dagster's embedded ETL now supports dlt - enabling devs to do what they love - build.

The "Stop Reinventing Orchestration: Embedded ELT in the Orchestrator" blog post by Pedram from Dagster Labs, introduces the concept of Embedded ELT within an orchestration framework, highlighting the transition in data engineering from bulky, complex systems towards more streamlined, embedded solutions that simplify data ingestion and management. This evolution is seen in the move away from heavy tools like Airbyte or Meltano towards utilizing lightweight, performant libraries which integrate seamlessly into existing orchestration platforms, reducing deployment complexity and operational overhead. This approach leverages the inherent capabilities of orchestration systems to handle concerns typical to data ingestion, such as state management, error handling, and observability, thereby enhancing efficiency and developer experience.

dlt was built for just such a scenario and we are happy to be adopted into it. Besides adding connectors, dlt adds a simple way to build custom pipelines.

Read more about it on Dagster blog post on dlt.

Dagworks' dlt + duckdb + ibis + Hamilton demo

The DAGWorks Substack post introduces a highly portable pipeline of all libraries, and leverages a blend of open-source Python libraries: dlt, Ibis, and Hamilton. This integration exemplifies the trend towards modular, decentralized data systems, where each component specializes in a segment of the data handling process—dlt for extraction and loading, Ibis for transformation, and Hamilton for orchestrating complex data flows. These technologies are not just tools but represent a paradigm shift in data engineering, promoting agility, scalability, and cost-efficiency in deploying serverless microservices.

The post not only highlights the technical prowess of combining these libraries to solve practical problems like message retention and thread summarization on Slack but also delves into the meta aspects of such integrations. It reflects on the broader implications of adopting a lightweight stack that can operate within diverse infrastructures, from cloud environments to embedded systems, underscoring the shift towards interoperability and backend agnosticism in data engineering practices. This approach illustrates a shift in the data landscape, moving from monolithic systems to flexible, adaptive solutions that can meet specific organizational needs without heavy dependencies or extensive infrastructure.

Read more about it on Dagworks blog post on dlt.

Closing thoughts

The concepts discussed here—portability, simplicity, and scalability—are central to modern data engineering practices. They reflect a shift towards tools that not only perform well but also integrate seamlessly across different environments, from high-powered servers to minimal infrastructures like edge devices. This shift emphasizes the importance of adaptability in tools used by data teams, catering to a broad spectrum of deployment scenarios without sacrificing performance.

In this landscape, dlt exemplifies the type of tool that embodies these principles. It's not just about being another platform; it's about providing a framework that supports the diverse needs of developers and engineers. dlt's design allows it to be embedded directly within various architectures, enabling teams to implement robust data processes with minimal overhead. This approach reduces complexity and fosters an environment where innovation is not hindered by the constraints of traditional data platforms.

We invite the community to engage with these concepts through dlt, contributing to its evolution and refinement. By participating in this collaborative effort, you can help ensure that the tool remains at the forefront of data engineering technology, providing effective solutions that address the real-world challenges of data management and integration.

Join the conversation and share your insights in our Slack community or contribute directly to the growing list of projects using us. Your expertise can drive the continuous improvement of dlt, shaping it into a tool that not only meets current demands but also anticipates future needs in the data engineering field.

· 7 min read
Adrian Brudaru

The last 5 years before working on dlt, I spent as a data engineering freelancer. Before freelancing, I was working for "sexy but poor" startups where building fast and cheap was a religion.

In this time, I had the pleasure of doing many first time setups, and a few "rebuilds" or "second time setups".

In fact, my first freelancing project was a "disaster recovery" one.

A "second time build" or "disaster recovery project" refers to the process of re-designing, re-building, or significantly overhauling a data warehouse or data infrastructure after the initial setup has failed to meet the organization's needs.

dipping your toes in disaster

The first time builds gone wrong

There's usually no need for a second time build, if the first time build works. Rather, a migration might cut it. A second time build usually happens only if

  • the first time build does not work, either now or for the next requirements.
  • the first time build cannot be "migrated" or "fixed" due to fundamental flaws.

Let's take some examples from my experiences. Example 1: A serial talker takes a lead role at a large, growing startup. They speak like management, so management trusts. A few years later

  • half the pipelines are running on Pentaho + windows, the other are python 2, 3 and written by agencies.
  • The data engineering team quit. They had enough.
  • The remaining data engineers do what they want - a custom framework - or they threaten to quit, taking the only knowledge of the pipelines with them.
  • Solution: Re-write all pipelines in python3, replace custom framework with airflow, add tests, github, and other best pratices.

Example 2: A large international manufacturing company needed a data warehouse.

  • Microsoft sold them their tech+ consultants.
  • 2 years later, it's done but doesn't work (query time impossible)
  • Solution: Teach the home DE team to use redshift and migrate.

Example 3: A non technical professional takes a lead data role and uses a tool to do everything.

  • same as above but the person also hired a team of juniors
  • since there was no sudden ragequit, the situation persisted for a few years
  • after they left, the remaining team removed the tool and re-built.

Example 4: A first time data hire introduces a platform-like tool that's sql centric and has no versioning, api, or programmatic control.

  • after writing 30k+ lines of wet sql, scheduling and making them dependent on each other in this UI tool (without lineage), the person can no longer maintain the reports
  • Quits after arguing with management.
  • Solution: Reverse engineer existing reports, account for bugs and unfulfilled requirements, build them from scratch, occasionally searching the mass of sql. Outcome was under 2k lines.

Example 5: A VC company wants to make a tool that reads metrics from business apps like google ads, Stripe.

  • They end up at the largest local agency, who recommends them a single - tenant SaaS MDS for 90k to set up and a pathway from there
  • They agreed and then asked me to review. The agency person was aggressive and queried my knowledge on unrelated things, in an attempt to dismiss my assessment.
  • Turns out the agency was selling "installing 5tran and cleaning the data" for 5k+ per source, and some implementation partners time.
  • The VC later hired a non technical freelancer to do the work.

Who can build a first time setup that scales into the future?

The non-negotiable skills needed are

  • Programming. You can use ETL tools for ingestion, but they rarely solve the problem fully (under 20% in my respondent network - these are generally <30 people companies)
  • Modelling. Architecture first, sql second, tools third.
  • Requirement collection. You should consult your stakeholders on the data available to represent their process, and reach a good result. Usually the stakeholders are not experts and will not be able to give good requirements.

Who's to blame and what can we do about it?

I believe the blame is quite shared. The common denominators seem to be

  • A lack of technical knowledge,
  • tools to fill the gap.
  • and a warped or dishonest self representation (by vendor or professional)

As for what to do about it: If you were a hiring manager, ensure that your first data hire has all the skills at their disposal, and make sure they don't just talk the talk but walk the walk. Ask for references or test them.

But you aren't a hiring manager (those folks don't read this blog).

So here's what you can do

  • Ensure all 3 skills are available - they do not need to all be in one person. You could hire a freelance DE to build first, and a technical analyst to fulfil requests and extend the stack.
  • Let vendors write about first data hire, and "follow the money" - Check if the advice aligns with their financial incentive. If it does, get a second opinion.
  • Choose tooling that scales across different stages of a data stack lifecycle, so the problem doesn't occur.
  • Use vendor agnostic components where possible (for example, dlt + sqlmesh + sql glot can create a db-agnostic stack that enables you to switch between dbs)
  • Behave better - the temptation to oversell yourself is there, but you could check yourself and look for a position where you can learn. Your professional network could be your biggest help in your career, don't screw them over.
  • Use independent freelancers for consulting. They live off reputation, so look for the recommended ones.

How to do a disaster recovery?

The problem usually originates from the lack of a skill, which downstreams into implementations that don't scale. However, the solution is often not as simple as adding the skill, because various workarounds were created to bridge that gap, and those workarounds have people working on them.

Simply adding that missing skill to the team to build the missing piece would create a redundancy, which in its resolution would kick out the existing workarounds. But workarounds are maintained by roles, so the original implementer will usually feel their position threatened; This can easily escalate to a people conflict which often leads with the workaround maker quitting (or getting fired).

How to manage the emotions?

  • Be considerate of people's feelings - you are brought in to replace their work, so make it a cooperative experience where they can be the hero.
  • Ask for help when you are not sure about who has the decision over an area.

How to manage the technical side?

  • Ensure you have all the skills needed to deliver a data stack on the team.
  • If the existing solution produces correct results, use it as requirements for the next - for example, you could write tests that check that business rules are correctly implemented.
  • Clarify with stakeholders how much the old solution should be maintained - it will likely free up people to work on the new one.
  • Identify team skills that can help towards the new solution and consider them when choosing the technology stack.

What I wish I knew

Each "disaster recovery" project was more than just a technical reboot; it was a test to the team's adaptability and to their the humility to recognize and rectify mistakes. What I wish I knew is that building a data infrastructure is as much about building a culture of continuous learning and improvement as it is about the code and systems themselves, and that they need to be fixed together - otherwise, one will break the other.

Want to discuss?

Agencies and freelancers are often the heavy-lifters that are brought in to do such setups. Is this something you are currently doing? Tell us about your challenges, so we may better support you.

Join our slack community to take part in the conversation.

· 12 min read
Adrian Brudaru

shift-left-data-democracy

Definitions of how I use the terms:

Data Governance: A system of oversight and guidance over the data, much like a government is a system of oversight and guidance for a country. The opposite of governance is anarchy, chaos, and entropy.

Data Democracy: A type of governance that ensures stakeholders are part of the governance.

Shift left: Assuming data flows from left to right, shift left represents a focus towards the origin.

Data Mesh: A decentralized data management strategy that treats data as a product, with domain-specific teams managing its quality, governance, and lifecycle.

Shift Left Data Democracy: From Access to Involvement

In the traditional view, data democracy was largely about democratizing access—ensuring that everyone across the organization could easily retrieve and analyze data. This was a crucial step forward, breaking down silos and making information more available than ever before. However, as we've evolved, so too has our understanding of what true data democracy entails.

Shift left data democracy represents a more profound change. It's not just about making data accessible post-factum; it's about involving a broader spectrum of roles in the very processes of data ingestion, processing, and management. This approach extends the principles of democracy to the entire data lifecycle, beginning with data ingestion.

It's a shift from mere consumption to participation, where access is just the beginning.

Data mesh is the driver

Just as the data mesh concept emerged to address the complexities of managing data in a distributed, domain-oriented environment, we now see a need for technology to evolve in parallel. The goal? To put data sources directly in the hands of the people who use them. This means giving teams the tools and autonomy to manage and process their data, ensuring governance and quality from the outset and throughout the data's lifecycle.

This shift left approach to data democracy aligns with the idea behind data mesh, recognizing that effective data management and governance are not centralized activities but distributed responsibilities. By involving more stakeholders from the very start of the data flows, we're not just democratizing access; we're democratizing the entire data flows.

Governance, from a power game, to a team sport; A brief history of how we got here

Building a data warehouse is a beaten path - but how to go from technical solution to organisation-wide application?

Building a data warehouse for reporting on some business processes is a good start, but in order to leverage that data we need a culture to do so and the skills to do it correctly.

While a centralised solution enables a skilled team to deliver results, these results are often inflexible without hands on help - so how can the organisation be expected to become data driven? The process of tracking a goal, creating hypotheses, starting an experiment and then tracking outcomes is much more complex than that of tracking a metric in a dashboard.

Cue, the move to democratic data access.

From Monarchy to Democracy: Data access for the people!

The move from a centralised system to a democratic system comes from the competitive use of data. In a centralised system where only management has access, data is used to keep track of goals. To enable people to use that data to do something about the goals, the user must have access and understanding of the data.

As with anything, the first step is obvious: Give people access - without it, there is no progress. However, once we do that, the reality rears its ugly head: Access is not enough!

Democratic access is great but as long as the data producers are not providing clean documented data , we don't have a democracy. Instead what we have is reality-adjusted communism - we all have plentiful access to the same black box or garbage that the big central team put in.

monarchy-to-democracy

So, after democratizing data access, the next challenge was to answer the obvious question: So what does this data mean?

Turns out, the central team doesn't quite know either - it's rather the owner of the process we track, the data producer, that understands how the data they emit links to the real life process it tracks.

So how do we go from having data to understanding what it means?

From democratizing access to democratizing literacy though embedded analysts

One easy way to help teams understand the data is to give them an analyst resource. And what better than someone who knows their domain?

Cue the embedded analysts. These folks are crucial in bridging the gap between data capabilities and domain-specific needs. By positioning data experts within specific business units, organizations can ensure that the insights generated are highly relevant and immediately applicable to the domain's unique challenges and opportunities.

democracy-to-embedded

This placement helps in several key ways:

  • Domain expertise meets data munging: Embedded analysts develop a deep understanding of the specific challenges and workflows of the business unit they are part of, which enables them to tailor data models and analytics strategies effectively.
  • Data literacy: These analysts act as champions of data within their teams, educating and training non-data savvy members on data-driven decision-making processes. This upskills the team and increases the overall data literacy within the unit.
  • Faster response times: Being close to the operational realities of the business unit, embedded analysts can deliver faster, more targeted responses to data queries and needs, reducing the time from question to insight.

And just as we started, we solve another increment of the problem, which reveals the next.

Now that we can analyse the data, we need the data. But, it turns out the data we have is dirty, and we are missing some outright.

So let's solve the next problem: Data sources and quality.

The Advent of Data Mesh: Solving the data source problem

Wow, well we went quite a way to get here, and a decade after talking about democratization, we are starting to recognize that governance is an activity, not a process. And democracy is more work than we originally thought.

embedded-to-mesh

The data mesh architecture marks a significant evolution in the data democratization journey. Data mesh advances the principles of embedded analysts by decentralizing data ownership entirely, promoting domain-specific control over data assets.

This architectural approach is based on the idea that data should not only be accessible but also actionable across various sections of an organization without bottlenecks.

And just like governments hire a lot of people, turns out, a governance system also needs people to work for it.

Data mesh tries to solve much of that by embracing domain-oriented decentralization. In this model, data is treated as a product with the domain teams as the product owners. These teams are responsible for ensuring their data's quality and relevance, significantly reducing the latency issues found in centralized systems by eliminating the lengthy processes of data cleansing and approval.

Further, data mesh empowers teams with the necessary tools and authority to manage their data effectively, fostering a culture where data is a valuable asset across all levels of the organization. This approach not only supports rapid decision-making and innovation within teams but also offers scalability and flexibility as organizational data needs evolve, allowing domains to independently expand their data operations without a comprehensive overhaul of the central data infrastructure.

Of course, at this point having a complete or partial data platform that offers some governance starts to become very important as we don't want individual business units to be burdened with responsibity but without proper tooling - or the outcome will be high entropy.

From retrofitting governance to applying it from the start: Shift left data democracy!

mesh-to-sldd

Imagine a world where your company's data sources can just be picked and unpacked in the destination of your choice by analysts - not through an external saas tool, but via an internal service portal.

Shift-Left Data Democracy (SLDD) is a concept in data management that advocates for integrating data governance early in the data lifecycle. This approach shifts governance practices from being a retrospective or reactionary activity to an integral part of the initial design and development phases of data systems. By doing so, SLDD aims to embed governance, quality controls, and compliance measures at the point of data creation and throughout its subsequent handling.

By embedding governance early in the data lifecycle, SLDD eliminates the complex and costly process of retrofitting governance frameworks to mature datasets and systems. This proactive approach leads to streamlined operations, reducing both the complexity and the cost traditionally associated with late-stage governance implementation.

This early incorporation of governance enhances transparency throughout the entire process. Stakeholders gain a clear understanding of how data is managed and governed from the start, building trust and ensuring compliance.

What's revolutionary about SLDD is that a governed data source can easily be unfolded into a normalised or analytical model.

This "ad hoc data mart" can be used without central bottlenecks and easily customised to fit specific cases without having to reach modelling consensus with other teams. This built-in modularity avoids the creation of more bottlenecks downstream, enabling fast research and development where needed.

Further, a well-defined governance framework enables greater innovation within safe boundaries. Teams can explore and innovate knowing they are aligned with compliance and operational standards, which speeds up experimentation and development cycles. This environment encourages a more dynamic approach to data handling, where creativity is not stifled by fear of violating governance protocols. By treating governance as an integral part of the data management process rather than a hindrance, SLDD fosters a culture where data truly drives innovation.

Distinction between data mesh and shift-left data democracy

While both concepts advocate for decentralized governance, they focus on different aspects of the data lifecycle. Data mesh architecture emphasizes the structural and operational decentralization of data management, granting autonomy to domain-specific teams. Shift-left data democracy, on the other hand, extends this decentralization to the very beginning of the data lifecycle, advocating for early involvement and broad stakeholder participation in governance processes.

The main difference is: Mesh is applied post-factum. For newly built systems, starting with governance as a technical universal standard is less complex. And while mesh grants autonomy, the entropy raises complexities and cost; on the other hand formalising and standardising responsibilities from the start of data production reduces entropy.

Practicing shift-left data democracy

So how do we do it? Is this a future or can we already do it?

We asked ourselves the same and we are working towards fully supporting the standard.

Ensuring quality at the source

Start with having quality control embedded in the source. Here's what I mean - start with a clear schema for your data, and ensure you have a strategy to adapt to change. One such strategy could be having data contracts, refusing and data that does not fit the defined schema. The other strategy, would be evolving the schema into a staging layer and notifying changes, so the engineering analyst can look into the data to understand what happened and correctly deal with the change.

At dlt we support schema evolution and data contracts. docs.

Metadata for full lineage

Column and row level lineage are a basic necessity of development and traceability, so ensure each ingested package is annotated with source and time. Keep track of when columns are added to a source. Associate those schema changes with the corresponding load package to achieve column and row level lineage already from the ingestion layer, referring to a source defined as pipeline code, not table.

Besides data lineage, you want semantic metadata. What does a source actually represent as a business entity or process? To govern data semantically, we would need semantic tags at the source. This would enable us to know how to work with the data. For example, we could generate data vault, 3nf, star schema or activity schema models algorithmically starting from annotated json documents.

Besides business entities, domains or processes, semantic tags could also designate PII, security policies, or anything actionable. For example, PII tags could enable automatic lineage documentation and governance, while access tags could enable automatic access policies or automatic data modelling.

dlt currently supports column and row level lineage, as well as schema comments - which could be used as annotations.

The role of the Data platform engineer will grow

In a shift left data democracy, the data platform engineer is a key character, as much as a CTO is in an organisation. By having a data platform engineer you ensure your data governance is done with automated tooling, to support implementation and compliance.

These data platform engineer becomes pivotal in empowering the democratization of data, providing the essential tooling and infrastructure that allow teams across the organization to manage their data autonomously.

Data platform engineers become enablers and facilitators, embedding governance and quality controls right from the start of the data lifecycle. Their work supports the organization by ensuring that data management practices are not only compliant and secure but also accessible and intuitive for non-specialists (democratic). This shift underlines a transition from centralized control to distributed empowerment, where data platform engineers support the broader goal of making data accessible, manageable, and governable across the entire spectrum of the organization.

The future of data management

history_to_future

Are we heading towards semantically annotated data marts as code? Why not? We're in the age of serverless infrastructures, after all. Could data sociocracy become the future? Would we eventually encourage the entire organisation to annotate data sources with their learnings? Only time will tell.

Want to discuss?

Join the dlt slack community to take part in the conversation.

· 8 min read
Adrian Brudaru

The concept of simplicity and automation in a programming language is not new. Perl scripting language had the motto "Perl makes easy things easy and hard things possible".

The reason for this motto, was the difficulty of working with C, which requires more manual handling of resources and also a compilation step.

Perl scripts could be written and executed rapidly, making it ideal for tasks that needed quick development cycles. This ease of use and ability to handle complex tasks without cumbersome syntax made Perl incredibly popular in its heyday.

Perl was introduced as a scripting language that emphasized getting things done. It was created as a practical extraction and reporting tool, which quickly found its place in system administration, web development, and network programming.

History repeats, Python is a language for humans

human-building

Python took the philosophy of making programming more accessible and human-friendly even further. Guido van Rossum created Python with the goal of removing the drudgery from coding, choosing to prioritize readability and simplicity. This design philosophy makes Python an intuitive language not just for seasoned developers but for beginners as well. Its syntax is clean and expressive, allowing developers to write fewer lines of code for tasks that would require more in Perl or other languages. Python's extensive standard library, along with its powerful data structures, contribute to its ability to handle complex applications with ease.

Python's widespread adoption across various domains, from web development to data science and machine learning, is largely attributed to its accessibility.

Its simple syntax resembles natural language, which lowers the barrier to entry for programming. Compared to Perl, Python offers an even more organized and readable approach to coding, making it an ideal teaching language that prepares new developers for future challenges in software development.

And just like perl, it's used for data extraction and visualisation - but now it's done by normie humans, not sysadmins or devs.

dlt makes easy things fast, and hard things accessible

Following the principles of Perl and Python, dlt aimed to simplify the data engineering process. dlt focuses on making the extraction and loading of data as straightforward as possible.

dlt makes easy things fast

Starting from a simple abstraction like pipeline.run(data, table_name="table"), where data can be any iterable such as a generator or dataframe, dlt enables robust loading. Here is what the above function does, so you don't have to.

  • It will (optionally) unpack nested lists into separate tables with generated join keys, and flatten nested dictionaries into a main row.
  • If given a generator, it will consume it via microbatching, buffering to disk or external drives, never running out of memory (customisable).
  • it will create "extract packages" of extracted data so if the downstream steps fail, it can resume/retry later.
  • It will normalise the data into a shape that naturally fits the database (customisable).
  • It will create "load packages" of normalised data so if the downstream steps fail, it can retry later.
  • It infers and loads with the correct data types, for example from ISO timestamp strings (configurable).
  • It can accept different types of write dispositions declaratively such as 'append', 'merge' and 'replace'.
  • It will evolve the schema if we load a second time something with new columns, and it can alert the schema changes.
  • It will even create type variant columns if data types change (and alert if desired).
  • Or you can stop the schema from evolving and use the inferred schema or a modified one as a data contract
  • It will report load packages associated with new columns, enabling passing down column level lineage

That's a lot of development and maintenance pain solved only at its simplest. You could say, the dlt loader doesn't break, as long as it encounters common data types. If an obscure type is in your data, it would need to be added to dlt or converted beforehand.

From robust loading to robust extraction

Building on the simple loading abstraction, dlt is more than a tool for simple things.

The next step in dlt usage is to leverage it for extraction. dlt offers the concepts of 'source' and 'resource', A resource is the equivalent of a single data source, while a source is the group we put resources in to bundle them for usage.

For example, an API extractor from a single API with multiple endpoints, would be built as a source with multiple resources.

Resources enable you to easily configure how the data in that resource is loaded. You can create a resource by decorating a method with the '@resource' decorator, or you can generate them dynamically.

Examples of dynamic resources

  • If we have an api with multiple endpoints, we can put the endpoints in a list and iterate over it to generate resources
  • If we have an endpoint that gives us datapoints with different schemas, we could split them by a column in the data.
  • Similarly, if we have a webhook that listens to multiple types of events, it can dispatch each event type to its own table based on the data.
  • Or, if we want to shard a data stream into day-shards, we could append a date suffix in the resource name dynamically.

Once we group resources into a source, we can run them together (or, we could still run the resources independently)

Examples of reasons to group resources into sources.

  • We want to run (load) them together on the same schedule
  • We want to configure them together or keep their schemas together
  • They represent a single API and we want to publish them in a coherent, easy to use way.

So what are the efforts you spare when using dlt here?

  • A source can function similar to a class, but simpler, encouraging code reuse and simplicity.
  • Resources offer more granular configuration options
  • Resources can also be transformers, passing data between them in a microbatched way enabling patters like enrichments or list/detail endpoints.
  • Source schemas can be configured with various options such as pushing down top level columns into nested structures
  • dlt's requests replacement has built in retries for non-permanent error codes. This safeguards the progress of long extraction jobs that could otherwise break over and over (if retried as a whole) due to network or source api issues.

What else does dlt bring to the table?

Beyond the ease of data extraction and loading, dlt introduces several advanced features that further simplify data engineering tasks:

Asynchronous operations: dlt harnesses the power of asynchronous programming to manage I/O-bound and network operations efficiently. This means faster data processing, better resource utilization, and more responsive applications, especially when dealing with high volumes of data or remote data sources.

Flexible destinations and reverse ETL: dlt isn't just about pulling data in; it's about sending it where it needs to go. Whether it's a SQL database, a data lake, or a cloud-based storage solution or a custom reverse etl destination, dlt provides the flexibility to integrate with various destinations.

Optional T in ETL: With dlt, transformations are not an afterthought but a core feature. You can define transformations as part of your data pipelines, ensuring that the data is not just moved but refined, enriched, and shaped to fit your analytical needs. This capability allows for more sophisticated data modeling and preparation tasks to be streamlined within your ELT processes.

Data quality and observability: dlt places a strong emphasis on data quality and observability. It includes features for schema evolution tracking, data type validation, and error handling, and data contracts, which are critical for maintaining the integrity of your data ecosystem. Observability tools integrated within dlt help monitor the health and performance of your pipelines, providing insights into data flows, bottlenecks, and potential issues before they escalate.

Community and ecosystem: One of the most significant advantages of dlt is its growing community and ecosystem. Similar to Python, dlt benefits from contributions that extend its capabilities, including connectors, plugins, and integrations. This collaborative environment ensures that dlt remains at the forefront of data engineering innovation, adapting to new challenges and opportunities.

In essence, dlt is not just a tool but a comprehensive one stop shop that addresses the end-to-end needs of modern data ingestion. By combining the simplicity of Python with the robustness of enterprise-grade tools, dlt democratizes data engineering, making it accessible to a broader audience. Whether you're a data scientist, analyst, or engineer, dlt empowers you to focus on what matters most: deriving insights and value from your data.

Conclusion

As Perl and Python have made programming more accessible, dlt is set to transform data engineering by making sophisticated data operations accessible to all. This marks a significant shift towards the democratization of technology, enabling more individuals to contribute to and benefit from the digital landscape. dlt isn't just about making easy things fast and hard things accessible; it's about preparing a future where data engineering becomes an integral part of every data professional's toolkit.

· 8 min read
Adrian Brudaru

Why Python is the right approach for doing Reverse ETL

Reverse ETL is generally about putting data into a business application. This data would often come from a SQL database used as a middle layer for data integrations and calculations.

That’s fine - but nowadays most data people speak Python, and the types of things we want to put into an operational application don’t always come from a DB, they often come from other business applications, or from things like a dataframe on which we did some scoring, etc.

reverse etl

The full potential of Reverse ETL is in the flexibility of sources

SQL databases are a good start, but in reality very often our data source is something else. More often than not, it’s a Python analyst’s implementation of some scoring or some business calculation.

Other times, it’s a business application - for example, we might have a form that sends the response data to a webhook, from where it could end up in Salesforce, DWH, and Slack as a notification. And of course, if this is done by a data person it will be done in Python.

Such, it follows that if we want to cater to the data crowd, we need to be Pythonic.

There’s synergy with ETL

Reverse ETL is ultimately ETL. Data is extracted from a source, its transformed, and then loaded to a destination. The challenges are similar, the most notable difference being that pulling data from a strongly typed environment like a DB and converting it to weakly typed JSON is MUCH easier than the other way around. You can argue that Reverse ETL is simpler than ETL.

Flavors of Reverse ETL

Just like we have ETL and ELT, we also have flavors of Reverse ETL

  • Reverse ETL or TEL: Transform the data to a specification, read it from DB, and send it to an application.
  • Tool Reverse ETL or ETL: Extract from DB, map fields to destination in the tool, load to destination.
  • Pythonic Freestyle Reverse ETL: You extract data from wherever you want and put it anywhere except storage/DB. Transformations are optional.

Examples of Python reverse ETL

  • Read data from Mongo, do anomaly detection, and notify anomalies to Slack.
  • Read membership data from Stripe, calculate the chance to churn, and upload to CRM for account managers.
  • Capture a form response with a webhook and send the information to CRM, DWH, and Slack.

Add python? - new skills unlocked!

So why is it much better to do reverse ETL in Python?

  • Live Streaming and Flexibility: Python's ability to handle live data streams and integrate with various APIs and services surpasses the capabilities of SQL-based data warehouses designed for batch processing.
  • End-to-End Workflow: Employing Python from data extraction to operational integration facilitates a streamlined workflow, enabling data teams to maintain consistency and efficiency across the pipeline.
  • Customization and Scalability: Python's versatility allows for tailored solutions that can scale with minimal overhead, reducing the reliance on additional tools and simplifying maintenance.
  • Collaboration and Governance: By keeping the entire data workflow within Python, teams can ensure better governance, compliance, and collaboration, leveraging common tools and repositories.

Example: Building a Custom Destination and a pipeline in under 1h

Documentation used: Building a destination: docs SQL source: docs In this example, you will see why it’s faster to build a custom destination than set up a separate tool.

dlt allows you to define custom destination functions. You'll write a function that extracts the relevant data from your dataframe and formats it for the Notion API.

This example assumes you have set up Google Sheets API access and obtained the necessary credentials to authenticate.

Step 1: Setting Up Google Sheets API (10min)

  1. Enable the Google Sheets API in the Google Developers Console.
  2. Download the credentials JSON file.
  3. Share the target Google Sheet with the email address found in your credentials JSON file.

Step 2: Define the Destination method in its own file sheets_destination.py (20min)

Install the required package for the Google API client:

pip install --upgrade google-api-python-client google-auth-httplib2 google-auth-oauthlib

Here’s how to define a destination function to update a Google Sheet. In our case we wrote a slightly complex function that checks the headers and aligns the columns with the existing ones before inserting:

import dlt
from google.oauth2.service_account import Credentials
from googleapiclient.discovery import build


@dlt.destination(batch_size=100)
def google_sheets(items,
table_schema,
sheets_id: str = dlt.config.value,
credentials_json: dict = dlt.secrets.value,
range_name: str = 'Sheet1'):
"""
Send data to a Google Sheet.
:param items: Batch of items to send.
:param table_schema: Schema of the table (unused in this example but required by dlt).
:param sheets_id: ID of the Google Sheet, retrieved from config.
:param credentials_json: Google Service Account credentials, retrieved from secrets.
:param range_name: The specific range within the Sheet where data should be appended.
"""
credentials = Credentials.from_service_account_info(credentials_json)
service = build('sheets', 'v4', credentials=credentials)

# Fetch existing headers from the sheet
existing_headers_result = service.spreadsheets().values().get(
spreadsheetId=sheets_id, range="Sheet1!A1:1"
).execute()
existing_headers = existing_headers_result.get('values', [[]])[0] if existing_headers_result.get('values') else []

# Determine new headers from items
new_keys = set().union(*(d.keys() for d in items))
# Identify headers that need to be added (not already existing)
headers_to_add = [key for key in new_keys if key not in existing_headers]
# New comprehensive headers list, preserving the order of existing headers and adding new ones at the end
comprehensive_headers = existing_headers + headers_to_add

# If there are headers to add, update the first row with comprehensive headers
if headers_to_add:
update_body = {'values': [comprehensive_headers]}
service.spreadsheets().values().update(
spreadsheetId=sheets_id, range="Sheet1!A1",
valueInputOption='RAW', body=update_body
).execute()

# Prepare the data rows according to the comprehensive headers list
values = []
for item in items:
row = [item.get(header, "") for header in comprehensive_headers] # Fill missing keys with empty string
values.append(row)

body = {'values': values}

# Append the data rows
if values:
append_body = {'values': values}
append_result = service.spreadsheets().values().append(
spreadsheetId=sheets_id, range=range_name,
valueInputOption='RAW', insertDataOption='INSERT_ROWS', body=append_body
).execute()
print(f"{append_result.get('updates').get('updatedRows')} rows have been added to the sheet.")


Step 3: Configure secrets (5min)

For the custom destination, you can follow this example. Configure the source as instructed in the source documentation.

secrets.toml

[destination.google_sheets]
credentials_json = '''
{
"type": "service_account",
"project_id": "your_project_id",
"private_key_id": "your_private_key_id",
...
}
'''

config.toml

sheets_id = "1xj6APSKhepp8-sJIucbD9DDx7eyBt4UI2KlAYaQ9EKs"

Step 4: Running the pipeline in sheets_destination.py(10min)

Now, assuming you have a source function dict_row_generator(), you can set up and run your pipeline as follows:

# ... destination code from above

# pass some destination arguments explicitly (`range_name`)
pipeline = dlt.pipeline("my_google_sheets_pipeline", destination=google_sheets(range_name="named_range"))

# Use the source function and specify the resource "people_report"
def dict_row_generator():
yield {"row": 1, 'a': "a"}
yield {"row": 2, 'b': "b"}
yield {"row": 3, 'c': "c"}
yield {"row": 1, 'a': 1}
yield {"row": 2, 'b': 2}
yield {"row": 3, 'c': 3}



# Now, run the pipeline with the specified source
info = pipeline.run(dict_row_generator)

In this setup, append_to_google_sheets acts as a custom destination within your dlt pipeline, pushing the fetched data to the specified Google Sheet. This method enables streamlined and secure data operations, fully utilizing Python's capabilities for Reverse ETL processes into Google Sheets.

What does dlt do for me here?

Using dlt for reverse ETL instead of plain Python, especially with its @dlt.destination decorator, provides a structured framework that streamlines the process of data integrating into various destinations. Here’s how the dlt decorator specifically aids you compared to crafting everything from scratch in plain Python:

Faster time to Production grade pipelines

The @dlt.destination decorator significantly reduces the need for custom boilerplate code. It provides a structured approach to manage batch processing, error handling, and retries, which would otherwise require complex custom implementations in plain Python. This built-in functionality ensures reliability and resilience in your data pipelines.

Focus on custom business logic and adding value

The flexibility of creating custom destinations with dlt shifts the focus from the possibilities to the necessities of your specific use case. This empowers you to concentrate on implementing the best solutions for your unique business requirements.

Scalability and efficient resource use

dlt facilitates efficient handling of large data loads through chunking and batching, allowing for optimal use of computing resources. This means even small worker machines can stream data effectively into your chosen endpoint instead of wasting a large machine waiting for the network. The library design supports easy scaling and adjustments. Making changes to batch sizes or configurations is straightforward, ensuring your data pipelines can grow and evolve with minimal effort. This approach simplifies maintenance and ensures that once a solution is implemented, it's broadly applicable across your projects.

In Conclusion

Reverse ETL is just a piece of the ETL puzzle. It could be done cleaner and better when done in Python end to end.

Tools will always appeal to the non-technical folks. However, anyone with the ability to do Python pipelines can do Reverse ETL pipelines too, bringing typical benefits of code vs tool to a dev team - customisation, collaboration, best practices, etc.

So read more about how to built a dlt destination and consider giving it a try in your next reverse ETL pipeline.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.