As DuckDB crosses 1M downloads / month, what do its users do?
- Matthaus Krzykowski,
Co-Founder & CEO
Summary
The excitement around DuckDB has steadily increased over the last year. The project has consistently crossed the 1M downloads per month mark over the last three months (Dec ‘22, Jan ‘23, Feb ‘23), a large increase from the ~250k downloads per month in February ‘22.
Like so many others, we are excited about the project, too. Recently, we attended the DuckDB conference and spoke with many members of the community to learn why people are excited about it. We examined issues on GitHub, interviewed some of the top contributors to the project, and even experimented with DuckDB ourselves to determine how we could contribute.
We aimed to identify the most popular reasons why people try out DuckDB with our research. We found five perspectives that people commonly have when trying out DuckDB.
"Normie" users love to drop-In DuckDB in all sorts of places
Last December, folks from the data + ML community organized a great conference, Normconf. Presenters and attendees were invited to share stories about everyday data problems and celebrate solutions. Many of the attendees referred to themselves as "normies" during the conference. We have found many of these folks overlap with those trying out DuckDB.
Normies have a specific way of solving problems that breaks some behavior patterns and expectations of previous generations of software engineers. As Ben Labaschin explains in his presentation Building an HTTPS Model API for Cheap: AWS, Docker, and the Normconf API, “normie software” has the following criteria:
- It does not require domain knowledge to use. You should need almost no documentation because time is most important and all learning is investment. Before you learn, you need to know if it will pay off in the future.
- The best tools can be reused in many contexts. You learn them once, and you can apply them everywhere.
- Tools should work together. You should pick the tool that works with your other tools.
Many at Normconf agreed that DuckDB is also a “normie tool”. It has a Python wrapper and can be dropped into any Python script, notebook, or Streamlit app. It has helpers and integrates with other libraries that are part of typical workflows (e.g. Pandas, Parquet, Arrow, etc). It is a powerful analytical database and brings local SQL execution without credentials and other hassles. You can scan data from and export data to Parquet, CSV, or JSON and query an S3 bucket directly.
Local Data Workflows Are Going Mainstream, and DuckDB Is at the Center
Many people struggle to access cloud data warehouses within their organizations. Some of the problems that these users encounter include:
- Painful processes to obtain credentials and permissions
- A setup that is perceived as "difficult"
- Deployment of local composer files
- Working on remote machines is often much less pleasant
Instead they often use DuckDB to load data locally. DuckDB enables people to start using data by:
- Allowing them to learn SQL and try examples without any setup
- Querying GCP or S3 buckets from a local machine
- Creating notebooks or data apps with embedded DuckDB that showcase their work
Prototyping and experimenting with production data locally on DuckDB is a popular practice. From what we learned, deployment of DuckDB to production is still quite rare. Companies seldom use local workflows as this depends on someone having their laptop turned on to function. However, many non-engineering personnel use DuckDB to access production data.
The community is exploring various ways to use DuckDB's columnar query engine (e.g. analytics)
As many people in the community are exploring how DuckDB could be used, the fact that the DuckDB engine provides a way to quickly query the columnar format seems to be central to multiple use cases.
In data processing and usage, there are two types of transformations:
- Non-time-critical, usually nightly, "transformation" jobs. These are run programmatically, the output is saved somewhere, and a business person consumes this output on demand a few hours or days later
- Time-critical "dashboard user access" jobs. A report is created as output from the former job. Now, a user wants to gain insights from it "on demand", so they aggregate this table and wait. This computation is now time-critical, because the user is actively waiting for it
Row-based engines like Postgres are great at the first type of job, which usually involves many joins and row-based operations. However, they are not fast at aggregating data sets, as this requires them to iterate over rows by index (access the row, find its value, sum it).
Column-based engines, such as DuckDB, the Postgres AlloyDB engine, MySQL Percona, Redshift, etc., excel at aggregation jobs, which, for example, count or sum elements in a column.
Here, we have found evidence of two use cases where DuckDB is particularly suited:
- Aggregation of event data (e.g. product analytics). A dump of events could be easily scanned and aggregated into "users", "sessions", counted, etc. By using a database architecture like an "activity schema" and a way to tag events, this would make a great "product analytics" embedded warehouse. The MotherDuck team explains some reasons why to use DuckDB for analytics here.
- Aggregation of "one big table" architectures that use "one big table" instead of a dimensional model. This is a common design pattern for organizations that do not tackle data architecture, such as small analyst teams.
The amount of data processed for analytics workloads is often smaller than people think, with dashboards typically built from aggregated data. The co-founder of Motherduck suggests that a past analysis of his of BigQuery queries found that 90% of queries in organisations processed less than 100 MB of data (while the co-founder of Ponder kind of disagrees). Many people load CSV, Parquet, and JSON files sizes ranging from 50 to 200M rows into DuckDB. This includes not only one-off data loading but also ongoing demo projects. Common advantages of DuckDB we frequently heard about are speed, costs, and the usability advantages mentioned above.
With DuckDB users bring the database engine to their data and instead of the other way around
For most of us this behavioural pattern should ring true:
- “I have my data where I see fit. For example, in a AWS S3 bucket or on my laptop. My workflows deal with this well. And the workflow is fairly simple and cheap.”
- “Now I want to process my data.”
- “So I move my data to where a database engine is. I load it somewhere. I load it to Snowflake, BigQuery, Redshift. This is more complicated and costs significantly.”
We encountered a different related pattern with DuckDB users. DuckDB users often bring the engine to their data:
- People drop it into AWS Lambda or Google Cloud Function to process data close to the engine. The few DuckDB production deployments that we have seen were all AWS Lambda centric. You can read a general example here
- The product that makes this pattern super easy for AWS is Boiling Data. The product also scales fairly well
The community frequently experiments with making such a setup work. In a recent post from frequent DuckDB issue contributor Mimoune Djouallah, he describes how he used Azure Storage, DuckDB, and an Azure ML Notebook to build a cost-effective data engineering pipeline for smaller data workloads. The pipeline involves reading data from Azure Storage, running complex queries, and saving the results in a bucket. The resulting bucket can be consumed in Synapse Serverless/PowerBI/Notebook, etc. We wonder, just like Mimoune and others do, if vendors will be building more support for smaller data workloads going forward.
It is probably worth mentioning at this stage that DuckDB and even PostgreSQL are not databases in the traditional sense. Rather, they are relational "database" management systems (RDBMS) that manage structured collections of data. While both can be used for querying internally stored data and reading external data like files or federated databases, DuckDB focuses primarily on the latter.
This means that it functions more as an RDBMS without an attached database. Therefore, it is inaccurate to refer to DuckDB as a database. Instead, we should consider the parquet file or data lake as the actual database, with DuckDB serving as a tool for managing and querying the data.
Some people are dropping DuckDB into their BI tooling
A few people people we met have chosen to have always-up file storage and on-demand DuckDB embedded in a BI tool. A BI tool that supports this use case is Metabase. It’s an interesting concept that might one day lead to a stable open source BI tool SQL cache.
This is different than the past. Classic business intelligence tools use the OLAP concept—the business user creates a pivot table, and the tool composes a SQL query automatically and issues it to the database. This slows things down and adds additional cost. Think about Data Studio doing queries on BigQuery just because you change the granularity.
If the part of the data is cached (in-memory or as temporary parquet file) and the BI tool has an embedded query engine, it can do following tricks, for free and in no-time:
- change the granularity of a time dimension between different units (hour, day, week, calendar week, US calendar week, etc.)
- drill downs, drill throughs, and filters
- leveraging joins through predefined paths or UI query builders