Skip to main content

3 posts tagged with "BigQuery"

View All Tags

· 6 min read
Anuun Chinbat

THE PROBLEM

There are two types of people: those who hoard thousands of unread emails in their inbox and those who open them immediately to avoid the ominous red notification. But one thing unites us all: everyone hates emails. The reasons are clear:

  • They're often unnecessarily wordy, making them time-consuming.
  • SPAM (obviously).
  • They become black holes of lost communication because CC/BCC-ing people doesn't always work.
  • Sometimes, there are just too many.

So, this post will explore a possible remedy to the whole email issue involving AI.


THE SOLUTION

Don't worry; it's nothing overly complex, but it does involve some cool tools that everyone could benefit from.

💡 In a nutshell, I created two flows (a main flow and a subflow) in Kestra :

  • The main flow extracts email data from Gmail and loads it into BigQuery using dlt, checks for new emails, and, if found, triggers the subflow for further processing.
  • The subflow utilizes OpenAI to summarize and analyze the sentiment of an email, loads the results into BigQuery, and then notifies about the details via Slack.

Just so you're aware:

  • Kestra is an open-source automation tool that makes both scheduled and event-driven workflows easy.
  • dlt is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.
tip

Wanna jump to the GitHub repo?


HOW IT WORKS

To lay it all out clearly: Everything's automated in Kestra, with hassle-free data loading thanks to dlt, and the analytical thinking handled by OpenAI. Here's a diagram to help you understand the general outline of the entire process.

overview

Now, let's delve into specific parts of the implementation.

The environment:

💡 The two flows in Kestra are set up in a very straightforward and intuitive manner. Simply follow the Prerequisites and Setup guidelines in the repo. It should take no more than 15 minutes.

Once you’ve opened http://localhost:8080/ in your browser, this is what you’ll see on your screen:

Kestra

Now, all you need to do is create your flows and execute them.

The great thing about Kestra is its ease of use - it's UI-based, declarative, and language-agnostic. Unless you're using a task like a Python script, you don't even need to know how to code.

tip

If you're already considering ways to use Kestra for your projects, consult their documentation and the plugin pages for further insights.

The data loading part

💡 This is entirely managed by dlt in just five lines of code.

I set up a pipeline using the Inbox source – a regularly tested and verified source from dlt – with BigQuery as the destination.

In my scenario, the email data doesn't have nested structures, so there's no need for flattening. However, if you encounter nested structures in a different use case, dlt can automatically normalize them during loading.

Here's how the pipeline is defined and subsequently run in the first task of the main flow in Kestra:

# Run dlt pipeline to load email data from gmail to BigQuery
pipeline = dlt.pipeline(
pipeline_name="standard_inbox",
destination='bigquery',
dataset_name="messages_data",
full_refresh=False,
)

# Set table name
table_name = "my_inbox"
# Get messages resource from the source
messages = inbox_source(start_date = pendulum.datetime(2023, 11, 15)).messages
# Configure the messages resource to get bodies of the emails
messages = messages(include_body=True).with_name(table_name)
# Load data to the "my_inbox" table
load_info = pipeline.run(messages)

In this setup ☝️, dlt loads all email data into the table “my_inbox”, with the email body specifically stored in the “body” column. After executing your flow in Kestra, the table in BigQuery should appear as shown below:

bigquery_my_inbox

tip

This implementation doesn't handle email attachments, but if you need to analyze, for instance, invoice PDFs from your inbox, you can read about how to automate this with dlt here.

The AI part

💡 In this day and age, how can we not incorporate AI into everything? 😆

But seriously, if you're familiar with OpenAI, it's a matter of an API call to the chat completion endpoint. What simplifies it even further is Kestra’s OpenAI plugin.

In my subflow, I used it to obtain both the summary and sentiment analysis of each email body. Here's a glimpse of how it's implemented:

- id: get_summary
type: io.kestra.plugin.openai.ChatCompletion
apiKey: "{{ secret('OPENAI_API') }}"
model: gpt-3.5-turbo
prompt: "Summarize the email content in one sentence with less than 30 words: {{inputs.data[0]['body']}}"
messages: [{"role": "system", "content": "You are a tool that summarizes emails."}]
info

Kestra also includes Slack, as well as BigQuery plugins, which I used in my flows. Additionally, there is a wide variety of other plugins available.

The automation part

💡 Kestra triggers are the ideal solution!

I’ve used a schedule trigger that allows you to execute your flow on a regular cadence e.g. using a CRON expression:

triggers:
- id: schedule
type: io.kestra.core.models.triggers.types.Schedule
cron: "0 9-18 * * 1-5"

This configuration ensures that your flows are executed hourly on workdays from 9 AM to 6 PM.


THE OUTCOME

A Slack assistant that delivers crisp inbox insights right at your fingertips:

slack.png

And a well-organized table in BigQuery, ready for you to dive into a more complex analysis:

bigquery_test.png

In essence, using Kestra and dlt offers a trio of advantages for refining email analysis and data workflows:

  1. Efficient automation: Kestra effortlessly orchestrates intricate workflows, integrating smoothly with tools like dlt, OpenAI, and BigQuery. This process reduces manual intervention while eliminating errors, and freeing up more time for you.
  2. User-friendly and versatile: Both Kestra and dlt are crafted for ease of use, accommodating a range of skill levels. Their adaptability extends to various use cases.
  3. Seamless scaling: Kestra, powered by Kafka and Elasticsearch, adeptly manages large-scale data and complex workflows. Coupled with dlt's solid data integration capabilities, it ensures a stable and reliable solution for diverse requirements.

HOW IT COULD WORK ELSEWHERE

Basically, you can apply the architecture discussed in this post whenever you need to automate a business process!

For detailed examples of how Kestra can be utilized in various business environments, you can explore Kestra's use cases.

Embrace automation, where the only limit is your imagination! 😛

· 10 min read
Rahul Joshi
info

TL;DR: I combined dlt, dbt, DuckDB, MotherDuck, and Metabase to create a Modern Data Stack in a box that makes it very easy to create a data pipeline from scratch and then deploy it to production.

I started working in dltHub in March 2023, right around the time when we released DuckDB as a destination for dlt. As a Python user, being able to create a data pipeline, load the data in my laptop, and explore and query the data all in python was awesome.

At the time I also came across this very cool article by Jacob Matson in which he talks about creating a Modern Data Stack(MDS) in a box with DuckDB. I was already fascinated with dlt and all the other new tools that I was discovering, so reading about this approach of combining different tools to execute an end-to-end proof of concept in your laptop was especially interesting.

Fast forward to a few weeks ago when dlt released MotherDuck as a destination. The first thing that I thought of was an approach to MDS in a box where you develop locally with DuckDB and deploy in the cloud with MotherDuck. I wanted to try it out.

What makes this awesome

In my example, I wanted to customize reports on top of Google Analytics 4 (GA4) and combine it with data from GitHub. This is usually challenging because, while exporting data from GA4 to BigQuery is simple, combining it with other sources and creating custom analytics on top of it can get pretty complicated.

By first pulling all the data from different sources into DuckDB files in my laptop, I was able to do my development and customization locally.

local-workflow

And then when I was ready to move to production, I was able to seamlessly switch from DuckDB to MotherDuck with almost no code re-writing!

production-workflow

Thus I got a super simple and highly customizable MDS in a box that is also close to company production setting.

What does this MDS in a box version look like?

ToolLayerWhy it’s awesome
dltdata ingestionridiculously easy to write a customized pipeline in Python to load from any source
DuckDBdata warehouse in your laptopfree, fast OLAP database on your local laptop that you can explore using SQL or python
MotherDuckdata warehouse in the cloudDuckDB, but in cloud: fast, OLAP database that you can connect to your local duckdb file and share it with the team in company production settings
dbtdata transformationan amazing open source tool to package your data transformations, and it also combines well with dlt, DuckDB, and Motherduck
Metabasereportingopen source, has support for DuckDB, and looks prettier than my Python notebook

How this all works

The example that I chose was inspired by one of my existing workflows: that of understanding dlt-related metrics every month. Previously, I was using only Google BigQuery and Metabase to understand dlt’s product usage, but now I decided to test how a migration to DuckDB and MotherDuck would look like.

The idea is to build a dashboard to track metrics around how people are using and engaging with dlt on different platforms like GitHub (contributions, activity, stars etc.), dlt website and docs (number of website/docs visits etc.).

This is a perfect problem to test out my new super simple and highly customizable MDS in a box because it involves combining data from different sources (GitHub API, Google Analytics 4) and tracking them in a live analytics dashboard.

  1. Loading the data using dlt

    The advantage of using dlt for data ingestion is that dlt makes it very easy to create and customize data pipelines using just Python.

    In this example, I created two data pipelines:

    • BigQuery → DuckDB: Since all the Google Analytics 4 data is stored in BigQuery, I needed a pipeline that could load all events data from BigQuery into a local DuckDB instance. BigQuery does not exist as a verified source for dlt, which means that I had to write this pipeline from scratch.
    • GitHub API → DuckDB:
      dlt has an existing GitHub source that loads data around reactions, PRs, comments, and issues. To also load data on stargazers, I had to modify the existing source.

    dlt is simple and highly customizable:

    • Even though Bigquery does not exist as a dlt source, dlt makes it simple to write a pipeline that uses Bigquery as a source. How this looks like:

      1. Create a dlt project:

        dlt init bigquery duckdb

        This creates a folder with the directory structure

        ├── .dlt
        │ ├── config.toml
        │ └── secrets.toml
        ├── bigquery.py
        └── requirements.txt
      2. Add BigQuery credentials inside .dlt/secrets.toml.

      3. Add a Python function inside bigquery.py that requests the data.

      4. Load the data by simply running python bigquery.py.

        See the accompanying repo for a detailed step-by-step on how this was done.

    • The data in BigQuery is nested, which dlt automatically normalizes on loading.

      BigQuery might store data in nested structures which would need to be flattened before being loaded into the target database. This typically increases the challenge in writing data pipelines.

      dlt simplifies this process by automatically normalizing such nested data on load.

      nested-bigquery

      Example of what the nested data in BigQuery looks like.

      normalized-bigquery

      dlt loads the main data into table ga_events, and creates another table ga_events__event_params for the nested data.

    • The existing Github source does not load information on stargazers. dlt makes it easy to customize the Github source for this.

      The way the existing GitHub verified source is written, it only loads data on GitHub issues, reactions, comments, and pull requests. To configure it to also load data on stargazers, all I had to do was to add a python function for it in the pipeline script.

      See the accompanying repo for a detailed step-by-step on how this was done.

  2. Using DuckDB as the data warehouse
    DuckDB is open source, fast, and easy to use. It simplifies the process of validating the data after loading it with the data pipeline.

    In this example, after running the BigQuery pipeline, the data was loaded into a locally created DuckDB file called ‘bigquery.duckdb’, and this allowed me to use python to the explore the loaded data:

    duckdb-explore

    The best thing about using DuckDB is that it provides a local testing and development environment. This means that you can quickly and without any additional costs test and validate your workflow before deploying it to production.

    Also, being open source, it benefits from community contributions, particularly dbt-duckdb adapter and the DuckDB Metabase driver, which make it very useful in workflows like these.

  3. dbt for data transformations

    Because of dlt’s dbt runner and DuckDB’s dbt adapter, it was very easy to insert dbt into the existing workflow. What this looked like:

    1. I first installed dbt along with the duckdb adapter using pip install dbt-duckdb .
    2. I then created a dbt project inside the dlt project using dbt init and added any transforms as usual.
    3. Finally, I added the dlt’s dbt runner to my python script, and this configured my pipeline to automatically transform the data after loading it. See the documentation for more information on the dbt runner.
  4. Metabase for the dashboard

    Metabase OSS has a DuckDB driver, which meant that I could simply point it to the DuckDB files in my system and build a dashboard on top of this data.

    dashboard-1

    dashboard-2

    dashboard-3

    dashboard-4

  5. Going to production: Using MotherDuck as the destination

    So far the process had been simple. The integrations among dlt, dbt, DuckDB, and Metabase made the loading, transformation, and visualization of data fairly straight-forward. But the data loaded into DuckDB existed only in my machine, and if I wanted share this data with my team, then I needed to move it to a different storage location accessible by them.

    The best and the easiest way to do this was to use MotherDuck: a serverless cloud analytics platform built on top of DuckDB, where you can host your local DuckDB databases.

    Why choose MotherDuck

    1. Go from development to production with almost no code re-writing:

      This was my main reason for choosing MotherDuck. MotherDuck integrates with dlt, dbt, and Metabase just as well as DuckDB. And I was able to replace DuckDB with MotherDuck in my pipeline with almost no code re-writing!

      What this process looked like:

      1. First, I modified the dlt pipelines to load to MotherDuck instead of DuckDB as follows:
        1. I added credentials for MotherDuck inside .dlt/secrets.toml
        2. I made a minor update to the code: i.e. just by changing destination = "duckdb" to destination = "motherduck" the pipelines were already configured to load the data into MotherDuck instead of DuckDB
      2. With this change, I was already able to deploy my pipelines with GitHub actions.
      3. After deploying, I simply changed the DuckDB path to the MotherDuck path in Metabase, and then I deployed Metabase on GCP.

      The reason this is great is because it greatly simplifies the development lifecycle. Using DuckDB + MotherDuck, you can develop and test your pipeline locally and then move seamlessly to production.

    2. Very easy to copy local DuckDB databases to MotherDuck

      This was especially useful in this demo. Google Analytics 4 events data is typically large and when fetching this data from BigQuery, you are billed for the requests.

      In this example, after I ran the BigQuery -> DuckDB pipeline during development, I wanted to avoid loading the same data again when deploying the pipeline. I was able to do this by copying the complete local DuckDB database to MotherDuck, and configuring the pipeline to only load new data from BigQuery.

    3. Easy to share and collaborate

      Being able to share data with the team was the main goal behind moving from DuckDB to a cloud data warehouse. MotherDuck provides a centralized storage system for the DuckDB databases which you can share with your team, allowing them to access this data from their own local DuckDB databases.

      In my example, after I load the data to MotherDuck, I can provide access to my team just by clicking on ‘Share’ in the menu of their web UI.

      motherduck-share

Conclusion:

This was a fun and interesting exercise of creating a simple, yet powerful Modern Data Stack in a box. For me the biggest positives about this approach are:

  1. Everything was happening on my laptop during the development giving me full control. Still going to production was seamless and I didn't need to change my code and data transformations at all.
  2. I really liked that I could come with my ideas on what data I need and just write the pipelines in Python using dlt. I was not forced to pick from a small pull of existing data extractors. Both, customizing code contributed by others and writing my bigquery source from scratch, were fun and not going beyond Python and data engineering knowledge that I had.
  3. I'm impressed by how simple and customizable my version of MDS is. dlt, DuckDB, and MotherDuck share similar philosophy of giving full power to the local user and and making it easy to interact with them in Python.

I repeat this entire process for the BigQuery pipeline in this video:

· 5 min read
Rahul Joshi
info

TL;DR: Trying to become more user-centric and make data driven decisions? Get started with the SQL source pipeline + BigQuery + Metabase

When you have a web and / or mobile app but no data yet

If you're a startup without a dedicated data team but a sizeable number of users on your website or mobile app, then chances are that you are collecting and storing all your product data in OLTP databases like MySQL, Postgres, etc. As you have grown, you have likely been aiming to become more user-centric, yet you find that no one at your company has information on what your users do or what their experience is like. Stakeholders should be making data-driven decisions, but they are not yet because they are unable to use the existing data to understand user behavior and experience. This is usually the point when folks realize they need a data warehouse.

Why a data warehouse is necessary

OLTP databases are great because they are optimized to handle high-volume real-time transactions and maintain data integrity and consistency. However, they are not very well-suited for advanced analytics and data modelling. If you want to create reports, dashboards, and more that help you understand you users, you are going to want to extract, load, and transform (ELT) into a OLAP database like Google BigQuery, Snowflake, etc. To do this, you will need to create a data pipeline, which can be quite challenging if your company does not have a dedicated data engineering team.

Why a data pipeline is necessary

Production dashboards rely on the availability of consistent, structured data, which necessitates deploying a data pipeline that is idompotent, can manage the schema and handle schema changes, can be deployed to load data incrementally, etc. For most startups, it's not obvious how to create such pipelines. This is why we decided to demonstrate how one can set up such a data pipeline and build analytics dashboards on top of it.

Why a reporting tool is necessary

We chose to build our dashboard in Metabase because it also offers an open source edition. The advantage of reporting tools like Metabase is that they are easy and intuitive to use even for people who can't write SQL, but at the same time they are powerful enough for those who would like to use SQL.

How we set this up

1. Creating a PostgreSQL -> BigQuery pipeline

Our aim was to create a Metabase dashboard to explore data in a transactional database. The data set that we chose was a sample of The Dell DVD Store 2 database, which we put into a Postgres database deployed on a Google Cloud SQL instance. To make this data available to Metabase, we needed to first load all of the data into a BigQuery instance, and for this we needed a data pipeline. We created this pipeline by doing very simple customizations on the existing dlt sql_database pipeline. See the accompanying repo for the steps we followed.

2. Building a Metabase reporting dashboard

With the database uploaded to BigQuery, we were now ready to build a dashboard. We created a Metabase cloud account and connected it to our BigQuery instance. This made the whole database accessible to Metabase and we were able to analyze the data.

The DVD store database contains data on the products (film DVDs), product categories, existing inventory, customers, orders, order histories etc. For the purpose of the dashboard, we decided to explore the question: How many orders are being placed each month and which films and film categories are the highest selling?

orders_chart.png top_selling_tables.png In addition to this, we were also able to set up email alerts to get notified whenever the stock of a DVD was either empty or close to emptying.

low_stock_email_alert.png

3. Deploying the pipeline

With our dashboard ready, all we had to do was deploy our pipeline so that the dashboard could get updated with new data daily. Since the dashboard only uses some of the tables, we needed to modify the pipeline, that was configured to load the entire database, to instead only update the necessary tables. We also wanted to make it possible for the pipeline to load tables incrementally whenever possible.

We first started by selecting the tables that we wanted to update, namely: orders, orderlines, products, categories, inventory. We then decided whether we wanted to update the tables incrementally or with full replace:

  • Tables orders and orderlines contain data on the orders placed. This means that they also contain a date column and hence are loaded incrementally every day.
  • Tables products, categories, and inventory contain information on the existing products. These tables don't contain a date column and are updated whenever there is any change in inventory. Since the values of the existing data in the tables can change, these tables are not updated incrementally, but are instead fully loaded each time the pipeline is run.

In order to specify these conditions and deploy our pipeline in production, we followed these steps.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.