Unified Data Access with dlt, from local to cloud
- Adrian Brudaru,
Co-Founder & CDO
A universal interface for your data
What if there was one universal interface to simplify it all?
That’s exactly what dlt
does with its dataset
interface. You can treat your data the same way, no matter where it actually lives or the format it’s in.
A small example
- Let’s assume we have some sample data coming from a dlt source, Hubspot. We load it to local filesystem for some experiments.
import dlt
from hubspot import hubspot
local_pipeline = dlt.pipeline(
pipeline_name="hubspot_pipeline",
destination="filesystem",
dataset_name="hubspot_db"
)
local_pipeline.run(hubspot())
- Example access: We can access this data with SQL or python. Even join it
# Access the dataset from the pipeline
dataset = local_pipeline.dataset()
# Step 2: Access a table as a ReadableRelation
deals_table = dataset.deals # Or dataset["items"]
# Step 3: Fetch the entire table as a Pandas DataFrame
df = deals_table.df()
# Alternatively, fetch as a PyArrow Table
arrow_table = deals_table.arrow()
- Create a transformed relation with SQL or python
transformed_relation = dataset("""SELECT *
FROM hubspot_dataset.deals d
JOIN hubspot_dataset.pipelines_deals pd ON d.pipeline = pd.id;""")
- Save result to online destination (say BigQuery)
# Configure a pipeline to BigQuery
prod_pipeline = dlt.pipeline(
pipeline_name="bigquery_demo",
destination="bigquery",
dataset_name="transformed_dataset"
)
bq_pipeline.run(transformed_data.iter_arrow(chunk_size=10000),
table_name="hubspot_deals_data")
What’s the significance?
We built the dataset access for downstream code portability in dlt+. Here’s what it enables
Simpler, higher quality development with Local-Online Parity
Local online parity means my code runs locally as it runs online. This has a couple implications:
First, I don’t need access to the online production system to develop my next pipeline. I could write all my code locally and then deploy it to production to run on that infra, with nothing changing.
Second, this means I do not need a special online test environment. I can just do it locally, enabling me to experiment quickly without needing special infrastructure or test setups. This also keeps the online environment clean of experimental datasets. Oh, and it’s faster than testing online, as I don’t have to wait for network when loading data.
Tech-agnosticity across data stacks enables data mesh, universal pipelines
Large org governance: If you’re a large organisation, chances are you have multiple data teams using multiple solutions. Datasets enables running the same code on any infrastructure, so this adds a level of collaboration and portability between teams using different stacks.
In dlt+, which builds around datasets, we enable the technical implementation of data mesh as well, by adding a decentralised access point via local catalogs.
Service provider templates: If you’re an agency, the lack of consistency across data stacks limits your offering and forces you to ineffectively develop bespoke every time. By having one code that runs on any tech stack, you increase your reachable market significantly, and enables you to sell outcomes instead of time, for a better margin and less repetitive work.
Effective data testing with portable compute
Testing data before loading it to an online destination can have several benefits.
Write Audit Publish is a testing pattern that enables you to improve your data quality and prevent disasters by testing the data before committing it online.
Why not test online? there are many reasons, like
- compliance (we may not upload non compliant data and then test),
- efficiency through individual disposable test environments, which can also be shared for review.
- cost effectiveness - local is free, cloud is expensive. When data devs develop online, they often cause huge cost, because cost management takes extra cognitive load and effort.
Okta team learned this the hard way when skyrocketing warehouse bills and a tangled S3-to-warehouse pipeline forced them to do all transformations and testing locally first (on AWS lambdas). This switch cut costs and also ensured higher data quality before anything hit production.
Vendor unlock, piecewise or complete
Multi engine data stacks are all the rage. In the Okta example above, moving the pre-processing to a more lightweight approach (DuckDB in Lambda) made things “an order of magnitude” cheaper, from a bill of 2000$/day. A small piece of processing ended up saving a lot.
We spoke to an anonymous user who runs their pipelines on git actions, dlt and duckdb, and Motherduck. This case leverages git actions free tier and motherduck credits, and replaces the complete cost of a data stack, from orchestrator, to compute, database, orchestration and ci-cd with a bill of a few cents per day.
Try technology faster, don’t get trapped.
Or maybe you want to try iceberg? How does it perform compared to delta tables? Benchmark with your dataset? Sure, you can do that in minutes without leaving your python script. Ultimately, code portability is key to enabling you to evaluate which solution is suitable for you and migrate quickly. If you are using dlt’s datasets, you are using a universal interface that enables you to later shift your code to a technical solution that might suit you better.
You might even use a different tech for a different solution - in the end, if you are managing access via the datasets interface and using pay as you go services, you aren’t really adding overhead.
Portability of code execution.
Another type of portability is that of the compute element. Perhaps today you have self hosted clickhouse, perhaps tomorrow you are buying always-on clickhouse cloud, and next week you go to an on-demand hosted clickhouse vendor. For example, one of our users moved their compute to tower.dev which handles serverless execution without the usual orchestration hassles.
Pip installable data products
Finally, what does it mean to us?
Our data [science] platform in a box, dlthub, enables building portable data products. For this, we need a data platform that can run the same way in the developer’s box as it runs online. Because online services cannot be feasibly replicated locally, having an abstraction layer that enables runtime portability becomes key.
This enables in dlt+
- an optimised development workspace for building your data platform complete with declarative workspace, local-online parity and all the necessary suppprting features for datasets
- pip installable datasets as portable data products, enabling decentralised data meshes
- dbt runner for datasets access and dbt model generator for quickly scaffolding a dbt project
- An easy way to build semantics aware ai agents thanks to additional metadata captured by dlt