Skip to main content

2 posts tagged with "python elt"

View All Tags

· 9 min read
Adrian Brudaru

What is the REST API Source toolkit?

tip

tl;dr: You are probably familiar with REST APIs.

  • Our new REST API Source is a short, declarative configuration driven way of creating sources.
  • Our new REST API Client is a collection of Python helpers used by the above source, which you can also use as a standalone, config-free, imperative high-level abstraction for building pipelines.

Want to skip to docs? Links at the bottom of the post.

Why REST configuration pipeline? Obviously, we need one!

But of course! Why repeat write all this code for requests and loading, when we could write it once and re-use it with different APIs with different configs?

Once you have built a few pipelines from REST APIs, you can recognise we could, instead of writing code, write configuration.

We can call such an obvious next step in ETL tools a “focal point” of “convergent evolution”.

And if you’ve been in a few larger more mature companies, you will have seen a variety of home-grown solutions that look similar. You might also have seen such solutions as commercial products or offerings.

But ours will be better…

So far we have seen many REST API configurators and products — they suffer from predictable flaws:

  • Local homebrewed flavors are local for a reason: They aren’t suitable for the broad audience. And often if you ask the users/beneficiaries of these frameworks, they will sometimes argue that they aren’t suitable for anyone at all.
  • Commercial products are yet another data product that doesn’t plug into your stack, brings black boxes and removes autonomy, so they simply aren’t an acceptable solution in many cases.

So how can dlt do better?

Because it can keep the best of both worlds: the autonomy of a library, the quality of a commercial product.

As you will see further, we created not just a standalone “configuration-based source builder” but we also expose the REST API client used enabling its use directly in code.

Hey community, you made us do it!

The push for this is coming from you, the community. While we had considered the concept before, there were many things dlt needed before creating a new way to build pipelines. A declarative extractor after all, would not make dlt easier to adopt, because a declarative approach requires more upfront knowledge.

Credits:

  • So, thank you Alex Butler for building a first version of this and donating it to us back in August ‘23: https://github.com/dlt-hub/dlt-init-openapi/pull/2.
  • And thank you Francesco Mucio and Willi Müller for re-opening the topic, and creating video tutorials.
  • And last but not least, thank you to dlt team’s Anton Burnashev (also known for gspread library) for building it out!

The outcome? Two Python-only interfaces, one declarative, one imperative.

  • dlt’s REST API Source is a Python dictionary-first declarative source builder, that has enhanced flexibility, supports callable passes, native config validations via python dictionaries, and composability directly in your scripts. It enables generating sources dynamically during runtime, enabling straightforward, manual or automated workflows for adapting sources to changes.
  • dlt’s REST API Client is the low-level abstraction that powers the REST API Source. You can use it in your imperative code for more automation and brevity, if you do not wish to use the higher level declarative interface.

Useful for those who frequently build new pipelines

If you are on a team with 2-3 pipelines that never change much you likely won’t see much benefit from our latest tool. What we observe from early feedback a declarative extractor is great at is enabling easier work at scale. We heard excitement about the REST API Source from:

  • companies with many pipelines that frequently create new pipelines,
  • data platform teams,
  • freelancers and agencies,
  • folks who want to generate pipelines with LLMs and need a simple interface.

How to use the REST API Source?

Since this is a declarative interface, we can’t make things up as we go along, and instead need to understand what we want to do upfront and declare that.

In some cases, we might not have the information upfront, so we will show you how to get that info during your development workflow.

Depending on how you learn better, you can either watch the videos that our community members made, or follow the walkthrough below.

Video walkthroughs

In these videos, you will learn at a leisurely pace how to use the new interface. Playlist link.

Workflow walkthrough: Step by step

If you prefer to do things at your own pace, try the workflow walkthrough, which will show you the workflow of using the declarative interface.

In the example below, we will show how to create an API integration with 2 endpoints. One of these is a child resource, using the data from the parent endpoint to make a new request.

Configuration Checklist: Before getting started

In the following, we will use the GitHub API as an example.

We will also provide links to examples from this Google Colab tutorial.

  1. Collect your api url and endpoints, Colab example:

    • An URL is the base of the request, for example: https://api.github.com/.
    • An endpoint is the path of an individual resource such as:
      • /repos/{OWNER}/{REPO}/issues;
      • or /repos/{OWNER}/{REPO}/issues/{issue_number}/comments which would require the issue number from the above endpoint;
      • or /users/{username}/starred etc.
  2. Identify the authentication methods, Colab example:

  3. Identify if you have any dependent request patterns such as first get ids in a list, then use id for requesting details. For GitHub, we might do the below or any other dependent requests. Colab example.:

    1. Get all repos of an org https://api.github.com/orgs/{org}/repos.
    2. Then get all contributors https://api.github.com/repos/{owner}/{repo}/contributors.
  4. How does pagination work? Is there any? Do we know the exact pattern? Colab example.

    • On GitHub, we have consistent pagination between endpoints that looks like this link_header = response.headers.get('Link', None).
  5. Identify the necessary information for incremental loading, Colab example:

    • Will any endpoints be loaded incrementally?
    • What columns will you use for incremental extraction and loading?
    • GitHub example: We can extract new issues by requesting issues after a particular time: https://api.github.com/repos/{repo_owner}/{repo_name}/issues?since={since}.

Configuration Checklist: Checking responses during development

  1. Data path:
  2. Unless you had full documentation at point 4 (which we did), you likely need to still figure out some details on how pagination works.
    1. To do that, we suggest using curl or a second python script to do a request and inspect the response. This gives you flexibility to try anything. Colab example.
    2. Or you could print the source as above - but if there is metadata in headers etc, you might miss it.

Applying the configuration

Here’s what a configured example could look like:

  1. Base URL and endpoints.
  2. Authentication.
  3. Pagination.
  4. Incremental configuration.
  5. Dependent resource (child) configuration.

If you are using a narrow screen, scroll the snippet below to look for the numbers designating each component (n).

# This source has 2 resources:
# - issues: Parent resource, retrieves issues incl. issue number
# - issues_comments: Child resource which needs the issue number from parent.

import os
from rest_api import RESTAPIConfig

github_config: RESTAPIConfig = {
"client": {
"base_url": "https://api.github.com/repos/dlt-hub/dlt/", #(1)
# Optional auth for improving rate limits
# "auth": { #(2)
# "token": os.environ.get('GITHUB_TOKEN'),
# },
},
# The paginator is autodetected, but we can pass it explicitly #(3)
# "paginator": {
# "type": "header_link",
# "next_url_path": "paging.link",
# }
# We can declare generic settings in one place
# Our data is stateful so we load it incrementally by merging on id
"resource_defaults": {
"primary_key": "id", #(4)
"write_disposition": "merge", #(4)
# these are request params specific to GitHub
"endpoint": {
"params": {
"per_page": 10,
},
},
},
"resources": [
# This is the first resource - issues
{
"name": "issues",
"endpoint": {
"path": "issues", #(1)
"params": {
"sort": "updated",
"direction": "desc",
"state": "open",
"since": {
"type": "incremental", #(4)
"cursor_path": "updated_at", #(4)
"initial_value": "2024-01-25T11:21:28Z", #(4)
},
}
},
},
# Configuration for fetching comments on issues #(5)
# This is a child resource - as in, it needs something from another
{
"name": "issue_comments",
"endpoint": {
"path": "issues/{issue_number}/comments", #(1)
# For child resources, you can use values from the parent resource for params.
"params": {
"issue_number": {
# Use type "resolve" to define child endpoint wich should be resolved
"type": "resolve",
# Parent endpoint
"resource": "issues",
# The specific field in the issues resource to use for resolution
"field": "number",
}
},
},
# A list of fields, from the parent resource, which will be included in the child resource output.
"include_from_parent": ["id"],
},
],
}

And that’s a wrap — what else should you know?

  • As we mentioned, there’s also a REST Client - an imperative way to use the same abstractions, for example, the auto-paginator - check out this runnable snippet:

    from dlt.sources.helpers.rest_client import RESTClient

    # Initialize the RESTClient with the Pokémon API base URL
    client = RESTClient(base_url="https://pokeapi.co/api/v2")

    # Using the paginate method to automatically handle pagination
    for page in client.paginate("/pokemon"):
    print(page)
  • We are going to generate a bunch of sources from OpenAPI specs — stay tuned for an update in a couple of weeks!

Next steps

· 3 min read
Adrian Brudaru

About Yummy.eu

Yummy is a Lean-ops meal-kit company streamlines the entire food preparation process for customers in emerging markets by providing personalized recipes, nutritional guidance, and even shopping services. Their innovative approach ensures a hassle-free, nutritionally optimized meal experience, making daily cooking convenient and enjoyable.

Yummy is a food box business. At the intersection of gastronomy and logistics, this market is very competitive. To make it in this market, Yummy needs to be fast and informed in their operations.

Pipelines are not yet a commodity.

At Yummy, efficiency and timeliness are paramount. Initially, Martin, Yummy’s CTO, chose to purchase data pipelining tools for their operational and analytical needs, aiming to maximize time efficiency. However, the real-world performance of these purchased solutions did not meet expectations, which led to a reassessment of their approach.

What’s important: Velocity, Reliability, Speed, time. Money is secondary.

Martin was initially satisfied with the ease of setup provided by the SaaS services.

The tipping point came when an update to Yummy’s database introduced a new log table, leading to unexpectedly high fees due to the vendor’s default settings that automatically replicated new tables fully on every refresh. This situation highlighted the need for greater control over data management processes and prompted a shift towards more transparent and cost-effective solutions.

10x faster, 182x cheaper with dlt + async + modal

Motivated to find a solution that balanced cost with performance, Martin explored using dlt, a tool known for its simplicity in building data pipelines. By combining dlt with asynchronous operations and using Modal for managed execution, the improvements were substantial:

  • Data processing speed increased tenfold.
  • Cost reduced by 182 times compared to the traditional SaaS tool.
  • The new system supports extracting data once and writing to multiple destinations without additional costs.

For a peek into on how Martin implemented this solution, please see Martin's async Postgres source on GitHub..

salo-martin-tweet

Taking back control with open source has never been easier

Taking control of your data stack is more accessible than ever with the broad array of open-source tools available. SQL copy pipelines, often seen as a basic utility in data management, do not generally differ significantly between platforms. They perform similar transformations and schema management, making them a commodity available at minimal cost.

SQL to SQL copy pipelines are widespread, yet many service providers charge exorbitant fees for these simple tasks. In contrast, these pipelines can often be set up and run at a fraction of the cost—sometimes just the price of a few coffees.

At dltHub, we advocate for leveraging straightforward, freely available resources to regain control over your data processes and budget effectively.

Setting up a SQL pipeline can take just a few minutes with the right tools. Explore these resources to enhance your data operations:

For additional support or to connect with fellow data professionals, join our community.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.