dltHub
Blog /

dlt & openAPI code generation: A step beyond APIs and towards 10,000s of live datasets

  • Matthaus Krzykowski,
    Co-Founder & CEO

Today we are releasing a proof of concept of the dlt init extension that can generate dlt pipelines from an OpenAPI specification.

If you build APIs, for example with FastAPI, you can, thanks to the OpenAPI spec, automatically generate a python client and give it to your users. Our demo takes this a step further and enables you to generate advanced dlt pipelines that, in essence, convert your API into a live dataset.

You can see how Marcin generates such a pipeline from the OpenAPI spec using the Pokemon API here.

Part of our vision is that each API will come with a dlt pipeline - similar to how these days often it comes with a Python client. We believe that very often API users do not really want to deal with endpoints, HTTP requests, and JSON responses. They need live, evolving datasets that they can place anywhere they want so that it's accessible to any workflow.

We believe that API builders will bundle dlt pipelines with their APIs only if such a process is hassle free. One answer to that is code generation and the reuse of information from the OpenAPI spec.

This release is a part of a bigger vision for dlt of a world centered around accessible data for modern data teams. In these new times code is becoming more disposable, but the data stays valuable. We eventually want to create an ecosystem where hundreds of thousands of pipelines will be created, shared, and deployed. Where datasets, reports, and analytics can be written and shared publicly and privately. Code generation is automation on steroids and we are going to be releasing many more features based on this principle.

Generating a pipeline for PokeAPI using OpenAPI spec

In the embedded loom you saw Marcin pull data from the dlt pipeline created from the OpenAPI spec. The proof of concept already uses a few tricks and heuristics to generate useful code. Contrary to what you may think, PokeAPI is a complex one with a lot of linked data types and endpoints!

  • It created a resource for all endpoints that return lists of objects.
  • It used heuristics to discover and extract lists wrapped in responses.
  • It generated dlt transformers from all endpoints that have a matching list resource (and return the same object type).
  • It used heuristics to find the right object ID to pass to the transformer.
  • It allowed Marcin to select endpoints using the questionary lib in CLI.
  • It lists at the top the endpoints with the most central data types (think of tables that refer to several other tables).

As mentioned, the PoC was well tested with PokeAPI. We know it also works with many others - we just can’t guarantee that our tricks work in all cases as they were not extensively tested.

Anyway: Try it out yourself!

We plan to take this even further!

  • We will move this feature into dlt init and integrate with LLM code generation!
  • Restructuring of the Python client: We will fully restructure the underlying Python client. We'll compress all the files in the pokemon/api folder into a single, nice, and extendable client.
  • GPT-4 friendly: We'll allow easy addition of pagination and other injections into the client.
  • More heuristics: Many more heuristics to extract resources, and their dependencies, infer the incremental and merge loading.
  • Tight integration with FastAPI on the code level to get even more heuristics!

Your feedback and help is greatly appreciated. Join our community, and let’s build together.