Skip to main content

3 posts tagged with "DuckDB"

View All Tags

· 10 min read
Rahul Joshi
info

TL;DR: I combined dlt, dbt, DuckDB, MotherDuck, and Metabase to create a Modern Data Stack in a box that makes it very easy to create a data pipeline from scratch and then deploy it to production.

I started working in dltHub in March 2023, right around the time when we released DuckDB as a destination for dlt. As a Python user, being able to create a data pipeline, load the data in my laptop, and explore and query the data all in python was awesome.

At the time I also came across this very cool article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I was already fascinated with dlt and all the other new tools that I was discovering, so reading about this approach of combining different tools to execute an end-to-end proof of concept in your laptop was especially interesting.

Fast forward to a few weeks ago when dlt released MotherDuck as a destination. The first thing that I thought of was an approach to MDS in a box where you develop locally with DuckDB and deploy in the cloud with MotherDuck. I wanted to try it out.

What makes this awesome

In my example, I wanted to customize reports on top of Google Analytics 4 (GA4) and combine it with data from GitHub. This is usually challenging because, while exporting data from GA4 to BigQuery is simple, combining it with other sources and creating custom analytics on top of it can get pretty complicated.

By first pulling all the data from different sources into DuckDB files in my laptop, I was able to do my development and customization locally.

local-workflow

And then when I was ready to move to production, I was able to seamlessly switch from DuckDB to MotherDuck with almost no code re-writing!

production-workflow

Thus I got a super simple and highly customizable MDS in a box that is also close to company production setting.

What does this MDS in a box version look like?

ToolLayerWhy it’s awesome
dltdata ingestionridiculously easy to write a customized pipeline in Python to load from any source
DuckDBdata warehouse in your laptopfree, fast OLAP database on your local laptop that you can explore using SQL or python
MotherDuckdata warehouse in the cloudDuckDB, but in cloud: fast, OLAP database that you can connect to your local duckdb file and share it with the team in company production settings
dbtdata transformationan amazing open source tool to package your data transformations, and it also combines well with dlt, DuckDB, and Motherduck
Metabasereportingopen source, has support for DuckDB, and looks prettier than my Python notebook

How this all works

The example that I chose was inspired by one of my existing workflows: that of understanding dlt-related metrics every month. Previously, I was using only Google BigQuery and Metabase to understand dlt’s product usage, but now I decided to test how a migration to DuckDB and MotherDuck would look like.

The idea is to build a dashboard to track metrics around how people are using and engaging with dlt on different platforms like GitHub (contributions, activity, stars etc.), dlt website and docs (number of website/docs visits etc.).

This is a perfect problem to test out my new super simple and highly customizable MDS in a box because it involves combining data from different sources (GitHub API, Google Analytics 4) and tracking them in a live analytics dashboard.

  1. Loading the data using dlt

    The advantage of using dlt for data ingestion is that dlt makes it very easy to create and customize data pipelines using just Python.

    In this example, I created two data pipelines:

    • BigQuery → DuckDB: Since all the Google Analytics 4 data is stored in BigQuery, I needed a pipeline that could load all events data from BigQuery into a local DuckDB instance. BigQuery does not exist as a verified source for dlt, which means that I had to write this pipeline from scratch.
    • GitHub API → DuckDB:
      dlt has an existing GitHub source that loads data around reactions, PRs, comments, and issues. To also load data on stargazers, I had to modify the existing source.

    dlt is simple and highly customizable:

    • Even though Bigquery does not exist as a dlt source, dlt makes it simple to write a pipeline that uses Bigquery as a source. How this looks like:

      1. Create a dlt project:

        dlt init bigquery duckdb

        This creates a folder with the directory structure

        ├── .dlt
        │ ├── config.toml
        │ └── secrets.toml
        ├── bigquery.py
        └── requirements.txt
      2. Add BigQuery credentials inside .dlt/secrets.toml.

      3. Add a Python function inside bigquery.py that requests the data.

      4. Load the data by simply running python bigquery.py.

        See the accompanying repo for a detailed step-by-step on how this was done.

    • The data in BigQuery is nested, which dlt automatically normalizes on loading.

      BigQuery might store data in nested structures which would need to be flattened before being loaded into the target database. This typically increases the challenge in writing data pipelines.

      dlt simplifies this process by automatically normalizing such nested data on load.

      nested-bigquery

      Example of what the nested data in BigQuery looks like.

      normalized-bigquery

      dlt loads the main data into table ga_events, and creates another table ga_events__event_params for the nested data.

    • The existing Github source does not load information on stargazers. dlt makes it easy to customize the Github source for this.

      The way the existing GitHub verified source is written, it only loads data on GitHub issues, reactions, comments, and pull requests. To configure it to also load data on stargazers, all I had to do was to add a python function for it in the pipeline script.

      See the accompanying repo for a detailed step-by-step on how this was done.

  2. Using DuckDB as the data warehouse
    DuckDB is open source, fast, and easy to use. It simplifies the process of validating the data after loading it with the data pipeline.

    In this example, after running the BigQuery pipeline, the data was loaded into a locally created DuckDB file called ‘bigquery.duckdb’, and this allowed me to use python to the explore the loaded data:

    duckdb-explore

    The best thing about using DuckDB is that it provides a local testing and development environment. This means that you can quickly and without any additional costs test and validate your workflow before deploying it to production.

    Also, being open source, it benefits from community contributions, particularly dbt-duckdb adapter and the DuckDB Metabase driver, which make it very useful in workflows like these.

  3. dbt for data transformations

    Because of dlt’s dbt runner and DuckDB’s dbt adapter, it was very easy to insert dbt into the existing workflow. What this looked like:

    1. I first installed dbt along with the duckdb adapter using pip install dbt-duckdb .
    2. I then created a dbt project inside the dlt project using dbt init and added any transforms as usual.
    3. Finally, I added the dlt’s dbt runner to my python script, and this configured my pipeline to automatically transform the data after loading it. See the documentation for more information on the dbt runner.
  4. Metabase for the dashboard

    Metabase OSS has a DuckDB driver, which meant that I could simply point it to the DuckDB files in my system and build a dashboard on top of this data.

    dashboard-1

    dashboard-2

    dashboard-3

    dashboard-4

  5. Going to production: Using MotherDuck as the destination

    So far the process had been simple. The integrations among dlt, dbt, DuckDB, and Metabase made the loading, transformation, and visualization of data fairly straight-forward. But the data loaded into DuckDB existed only in my machine, and if I wanted share this data with my team, then I needed to move it to a different storage location accessible by them.

    The best and the easiest way to do this was to use MotherDuck: a serverless cloud analytics platform built on top of DuckDB, where you can host your local DuckDB databases.

    Why choose MotherDuck

    1. Go from development to production with almost no code re-writing:

      This was my main reason for choosing MotherDuck. MotherDuck integrates with dlt, dbt, and Metabase just as well as DuckDB. And I was able to replace DuckDB with MotherDuck in my pipeline with almost no code re-writing!

      What this process looked like:

      1. First, I modified the dlt pipelines to load to MotherDuck instead of DuckDB as follows:
        1. I added credentials for MotherDuck inside .dlt/secrets.toml
        2. I made a minor update to the code: i.e. just by changing destination = "duckdb" to destination = "motherduck" the pipelines were already configured to load the data into MotherDuck instead of DuckDB
      2. With this change, I was already able to deploy my pipelines with GitHub actions.
      3. After deploying, I simply changed the DuckDB path to the MotherDuck path in Metabase, and then I deployed Metabase on GCP.

      The reason this is great is because it greatly simplifies the development lifecycle. Using DuckDB + MotherDuck, you can develop and test your pipeline locally and then move seamlessly to production.

    2. Very easy to copy local DuckDB databases to MotherDuck

      This was especially useful in this demo. Google Analytics 4 events data is typically large and when fetching this data from BigQuery, you are billed for the requests.

      In this example, after I ran the BigQuery -> DuckDB pipeline during development, I wanted to avoid loading the same data again when deploying the pipeline. I was able to do this by copying the complete local DuckDB database to MotherDuck, and configuring the pipeline to only load new data from BigQuery.

    3. Easy to share and collaborate

      Being able to share data with the team was the main goal behind moving from DuckDB to a cloud data warehouse. MotherDuck provides a centralized storage system for the DuckDB databases which you can share with your team, allowing them to access this data from their own local DuckDB databases.

      In my example, after I load the data to MotherDuck, I can provide access to my team just by clicking on ‘Share’ in the menu of their web UI.

      motherduck-share

Conclusion:

This was a fun and interesting exercise of creating a simple, yet powerful Modern Data Stack in a box. For me the biggest positives about this approach are:

  1. Everything was happening on my laptop during the development giving me full control. Still going to production was seamless and I didn't need to change my code and data transformations at all.
  2. I really liked that I could come with my ideas on what data I need and just write the pipelines in Python using dlt. I was not forced to pick from a small pull of existing data extractors. Both, customizing code contributed by others and writing my bigquery source from scratch, were fun and not going beyond Python and data engineering knowledge that I had.
  3. I'm impressed by how simple and customizable my version of MDS is. dlt, DuckDB, and MotherDuck share similar philosophy of giving full power to the local user and and making it easy to interact with them in Python.

I repeat this entire process for the BigQuery pipeline in this video:

· 3 min read
Matthaus Krzykowski

Using DuckDB, dlt, & GitHub to explore DuckDB

tip

TL;DR: We created a Colab notebook for you to learn more about DuckDB (or any open source repository of interest) using DuckDB, dlt, and the GitHub API 🙂

So is DuckDB full of data about ducks?

Nope, you can put whatever data you want into DuckDB ✨

Many data analysts, data scientists, and developers prefer to work with data on their laptops. DuckDB allows them to start quickly and easily. When working only locally becomes infeasible, they can then turn this local “data pond” into a data lake, storing their data on object storage like Amazon S3, and continue to use DuckDB as a query engine on top of the files stored there.

If you want to better understand why folks are excited about DuckDB, check out this blog post.

Perhaps ducks use DuckDB?

Once again, the answer is also 'nein'. As far as we can tell, usually people use DuckDB 🦆

To determine this, we loaded emoji reaction data for DuckDB repo using data load tool (dlt) from the GitHub API to a DuckDB instance and explored who has been reacting to issues / PRs in the open source community. This is what we learned…

The three issues / PRs with the most reactions all-time are

  1. SQLAlchemy dialect #305
  2. Add basic support for GeoSpatial type #2836
  3. Support AWS default credential provider chain #4021

The three issues / PRs with the most reactions in 2023 are

  1. Add support for Pivot/Unpivot statements #6387
  2. Add support for a pluggable storage and catalog back-end, and add support for a SQLite back-end storage #6066
  3. Add support for UPSERT (INSERT .. ON CONFLICT DO ..) syntax #5866

Some of the most engaged users (other than the folks who work at DuckDB Labs) include

All of these users seem to be people. Admittedly, we didn’t look at everyone though, so there could be ducks within the flock. You can check yourself by playing with the Colab notebook.

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?

Although this is a cool idea, it is still not the reason that it is called DuckDB 🌊

Using functionality offered by DuckDB to export the data loaded to it as Parquet files, you can create a small “data pond” on your local computer. To make it a data lake, you can then add these files to Google Cloud Storage, Amazon S3, etc. And if you want this data lake to always fill with the latest data from the GitHub API, you can deploy the dlt pipeline.

Check this out in the Colab notebook and let us know if you want some help setting this up.

Just tell me why it is called DuckDB!!!

Okay. It’s called DuckDB because ducks are amazing and @hannes once had a pet duck 🤣

Why "Duck" DB? Source: DuckDB: an Embeddable Analytical RDBMS

Enjoy this blog post? Give data load tool (dlt) a ⭐ on GitHub here 🤜🤛

· 9 min read
Matthaus Krzykowski

Summary

The excitement around DuckDB has steadily increased over the last year. The project has consistently crossed the 1M downloads per month mark over the last three months (Dec ‘22, Jan ‘23, Feb ‘23), a large increase from the ~250k downloads per month in February ‘22.

Like so many others, we are excited about the project, too. Recently, we attended the DuckDB conference and spoke with many members of the community to learn why people are excited about it. We examined issues on GitHub, interviewed some of the top contributors to the project, and even experimented with DuckDB ourselves to determine how we could contribute.

We aimed to identify the most popular reasons why people try out DuckDB with our research. We found five perspectives that people commonly have when trying out DuckDB.

Marcin watching a MotherDuck presentation

dltHub co-founder Marcin watching a MotherDuck presentation at DuckCon in Brussels in February

1) "Normie" users love to drop-In DuckDB in all sorts of places

Last December, folks from the data + ML community organized a great conference, Normconf. Presenters and attendees were invited to share stories about everyday data problems and celebrate solutions. Many of the attendees referred to themselves as "normies" during the conference. We have found many of these folks overlap with those trying out DuckDB.

Normies have a specific way of solving problems that breaks some behavior patterns and expectations of previous generations of software engineers. As Ben Labaschin explains in his presentation Building an HTTPS Model API for Cheap: AWS, Docker, and the Normconf API, “normie software” has the following criteria:

  • It does not require domain knowledge to use. You should need almost no documentation because time is most important and all learning is investment. Before you learn, you need to know if it will pay off in the future.
  • The best tools can be reused in many contexts. You learn them once, and you can apply them everywhere.
  • Tools should work together. You should pick the tool that works with your other tools.

Many at Normconf agreed that DuckDB is also a “normie tool”. It has a Python wrapper and can be dropped into any Python script, notebook, or Streamlit app. It has helpers and integrates with other libraries that are part of typical workflows (e.g. Pandas, Parquet, Arrow, etc). It is a powerful analytical database and brings local SQL execution without credentials and other hassles. You can scan data from and export data to Parquet, CSV, or JSON and query an S3 bucket directly.


2) Local Data Workflows Are Going Mainstream, and DuckDB Is at the Center

Many people struggle to access cloud data warehouses within their organizations. Some of the problems that these users encounter include:

  1. Painful processes to obtain credentials and permissions
  2. A setup that is perceived as "difficult"
  3. Deployment of local composer files
  4. Working on remote machines is often much less pleasant

Instead they often use DuckDB to load data locally. DuckDB enables people to start using data by:

  1. Allowing them to learn SQL and try examples without any setup
  2. Querying GCP or S3 buckets from a local machine
  3. Creating notebooks or data apps with embedded DuckDB that showcase their work

Prototyping and experimenting with production data locally on DuckDB is a popular practice. From what we learned, deployment of DuckDB to production is still quite rare. Companies seldom use local workflows as this depends on someone having their laptop turned on to function. However, many non-engineering personnel use DuckDB to access production data.

3) The community is exploring various ways to use DuckDB's columnar query engine (e.g. analytics)

As many people in the community are exploring how DuckDB could be used, the fact that the DuckDB engine provides a way to quickly query the columnar format seems to be central to multiple use cases.

In data processing and usage, there are two types of transformations:

  • Non-time-critical, usually nightly, "transformation" jobs. These are run programmatically, the output is saved somewhere, and a business person consumes this output on demand a few hours or days later
  • Time-critical "dashboard user access" jobs. A report is created as output from the former job. Now, a user wants to gain insights from it "on demand", so they aggregate this table and wait. This computation is now time-critical, because the user is actively waiting for it

Row-based engines like Postgres are great at the first type of job, which usually involves many joins and row-based operations. However, they are not fast at aggregating data sets, as this requires them to iterate over rows by index (access the row, find its value, sum it).

Column-based engines, such as DuckDB, the Postgres AlloyDB engine, MySQL Percona, Redshift, etc., excel at aggregation jobs, which, for example, count or sum elements in a column.

Here, we have found evidence of two use cases where DuckDB is particularly suited:

  1. Aggregation of event data (e.g. product analytics). A dump of events could be easily scanned and aggregated into "users", "sessions", counted, etc. By using a database architecture like an "activity schema" and a way to tag events, this would make a great "product analytics" embedded warehouse. The MotherDuck team explains some reasons why to use DuckDB for analytics here.
  2. Aggregation of "one big table" architectures that use "one big table" instead of a dimensional model. This is a common design pattern for organizations that do not tackle data architecture, such as small analyst teams.

The amount of data processed for analytics workloads is often smaller than people think, with dashboards typically built from aggregated data. The co-founder of Motherduck suggests that a past analysis of his of BigQuery queries found that 90% of queries in organisations processed less than 100 MB of data (while the co-founder of Ponder kind of disagrees). Many people load CSV, Parquet, and JSON files sizes ranging from 50 to 200M rows into DuckDB. This includes not only one-off data loading but also ongoing demo projects. Common advantages of DuckDB we frequently heard about are speed, costs, and the usability advantages mentioned above.

4) With DuckDB users bring the database engine to their data and instead of the other way around

For most of us this behavioural pattern should ring true:

  1. “I have my data where I see fit. For example, in a AWS S3 bucket or on my laptop. My workflows deal with this well. And the workflow is fairly simple and cheap.”
  2. “Now I want to process my data.”
  3. “So I move my data to where a database engine is. I load it somewhere. I load it to Snowflake, BigQuery, Redshift. This is more complicated and costs significantly.”

We encountered a different related pattern with DuckDB users. DuckDB users often bring the engine to their data:

  1. People drop it into AWS Lambda or Google Cloud Function to process data close to the engine. The few DuckDB production deployments that we have seen were all AWS Lambda centric. You can read a general example here
  2. The product that makes this pattern super easy for AWS is Boiling Data. The product also scales fairly well

The community frequently experiments with making such a setup work. In a recent post from frequent DuckDB issue contributor Mimoune Djouallah, he describes how he used Azure Storage, DuckDB, and an Azure ML Notebook to build a cost-effective data engineering pipeline for smaller data workloads. The pipeline involves reading data from Azure Storage, running complex queries, and saving the results in a bucket. The resulting bucket can be consumed in Synapse Serverless/PowerBI/Notebook, etc. We wonder, just like Mimoune and others do, if vendors will be building more support for smaller data workloads going forward.

It is probably worth mentioning at this stage that DuckDB and even PostgreSQL are not databases in the traditional sense. Rather, they are relational "database" management systems (RDBMS) that manage structured collections of data. While both can be used for querying internally stored data and reading external data like files or federated databases, DuckDB focuses primarily on the latter.

This means that it functions more as an RDBMS without an attached database. Therefore, it is inaccurate to refer to DuckDB as a database. Instead, we should consider the parquet file or data lake as the actual database, with DuckDB serving as a tool for managing and querying the data.

5) Some people are dropping DuckDB into their BI tooling

A few people people we met have chosen to have always-up file storage and on-demand DuckDB embedded in a BI tool. A BI tool that supports this use case is Metabase. It’s an interesting concept that might one day lead to a stable open source BI tool SQL cache.

This is different than the past. Classic business intelligence tools use the OLAP concept—the business user creates a pivot table, and the tool composes a SQL query automatically and issues it to the database. This slows things down and adds additional cost. Think about Data Studio doing queries on BigQuery just because you change the granularity.

If the part of the data is cached (in-memory or as temporary parquet file) and the BI tool has an embedded query engine, it can do following tricks, for free and in no-time:

  • change the granularity of a time dimension between different units (hour, day, week, calendar week, US calendar week, etc.)
  • drill downs, drill throughs, and filters
  • leveraging joins through predefined paths or UI query builders

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.