Skip to main content

Is DuckDB a database for ducks?

· 3 min read
Matthaus Krzykowski

Using DuckDB, dlt, & GitHub to explore DuckDB

tip

TL;DR: We created a Colab notebook for you to learn more about DuckDB (or any open source repository of interest) using DuckDB, dlt, and the GitHub API 🙂

So is DuckDB full of data about ducks?

Nope, you can put whatever data you want into DuckDB ✨

Many data analysts, data scientists, and developers prefer to work with data on their laptops. DuckDB allows them to start quickly and easily. When working only locally becomes infeasible, they can then turn this local “data pond” into a data lake, storing their data on object storage like Amazon S3, and continue to use DuckDB as a query engine on top of the files stored there.

If you want to better understand why folks are excited about DuckDB, check out this blog post.

Perhaps ducks use DuckDB?

Once again, the answer is also 'nein'. As far as we can tell, usually people use DuckDB 🦆

To determine this, we loaded emoji reaction data for DuckDB repo using data load tool (dlt) from the GitHub API to a DuckDB instance and explored who has been reacting to issues / PRs in the open source community. This is what we learned…

The three issues / PRs with the most reactions all-time are

  1. SQLAlchemy dialect #305
  2. Add basic support for GeoSpatial type #2836
  3. Support AWS default credential provider chain #4021

The three issues / PRs with the most reactions in 2023 are

  1. Add support for Pivot/Unpivot statements #6387
  2. Add support for a pluggable storage and catalog back-end, and add support for a SQLite back-end storage #6066
  3. Add support for UPSERT (INSERT .. ON CONFLICT DO ..) syntax #5866

Some of the most engaged users (other than the folks who work at DuckDB Labs) include

All of these users seem to be people. Admittedly, we didn’t look at everyone though, so there could be ducks within the flock. You can check yourself by playing with the Colab notebook.

Maybe it’s called DuckDB because you can use it to create a "data pond" that can grow into a data lake + ducks like water?

Although this is a cool idea, it is still not the reason that it is called DuckDB 🌊

Using functionality offered by DuckDB to export the data loaded to it as Parquet files, you can create a small “data pond” on your local computer. To make it a data lake, you can then add these files to Google Cloud Storage, Amazon S3, etc. And if you want this data lake to always fill with the latest data from the GitHub API, you can deploy the dlt pipeline.

Check this out in the Colab notebook and let us know if you want some help setting this up.

Just tell me why it is called DuckDB!!!

Okay. It’s called DuckDB because ducks are amazing and @hannes once had a pet duck 🤣

Why "Duck" DB? Source: DuckDB: an Embeddable Analytical RDBMS

Enjoy this blog post? Give data load tool (dlt) a ⭐ on GitHub here 🤜🤛

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.