Skip to main content

2 posts tagged with "streamlit"

View All Tags

· 4 min read
Rahul Joshi
info

TL;DR: We created a Hacker News -> BigQuery dlt pipeline to load all comments related to popular ELT keywords and then used GPT-4 to summarize the comments. We now have a live dashboard that tracks these keywords and an accompanying GitHub repo detailing our process.

Motivation

To figure out how to improve dlt, we are constantly learning about how people approach extracting, loading, and transforming data (i.e. ELT). This means we are often reading posts on Hacker News (HN), a forum where many developers like ourselves hang out and share their perspectives. But finding and reading the latest comments about ELT from their website has proved to be time consuming and difficult, even when using Algolia Hacker News Search to search.

So we decided to set up a dlt pipeline to extract and load comments using keywords (e.g. Airbyte, Fivetran, Matillion, Meltano, Singer, Stitch) from the HN API. This empowered us to then set up a custom dashboard and create one sentence summaries of the comments using GPT-4, which made it much easier and faster to learn about the strengths and weaknesses of these tools. In the rest of this post, we share how we did this for ELT. A GitHub repo accompanies this blog post, so you can clone and deploy it yourself to learn about the perspective of HN users on anything by replacing the keywords.

Creating a dlt pipeline for Hacker News

For the dashboard to have access to the comments, we needed a data pipeline. So we built a dlt pipeline that could load the comments from the Algolia Hacker News Search API into BigQuery. We did this by first writing the logic in Python to request the data from the API and then following this walkthrough to turn it into a dlt pipeline.

With our dlt pipeline ready, we loaded all of the HN comments corresponding to the keywords from January 1st, 2022 onward.

Using GPT-4 to summarize the comments

Now that the comments were loaded, we were ready to use GPT-4 to create a one sentence summary for them. We first filtered out any irrelevant comments that may have been loaded using simple heuritics in Python. Once we were left with only relevant comments, we called the gpt-4 API and prompted it to summarize in one line what the comment was saying about the chosen keywords. If you don't have access to GPT-4 yet, you could also use the gpt-3.5-turbo API.

Since these comments were posted in response to stories or other comments, we fed in the story title and any parent comments as context in the prompt. To avoid hitting rate-limit error and losing all progress, we ran this for 100 comments at a time, saving the results in the CSV file each time. We then built a streamlit app to load and display them in a dashboard. Here is what the dashboard looks like:

dashboard.png

Deploying the pipeline, Google Bigquery, and Streamlit app

With all the comments loaded and the summaries generated in bulk, we were ready to deploy this process and have the dashboard update daily with new comments.

We decided to deploy our streamlit app on a GCP VM. To have our app update daily with new data we did the following:

  1. We first deployed our dlt pipeline using GitHub Actions to allow new comments to be loaded to BigQuery daily
  2. We then wrote a Python script that could pull new comments from BigQuery into the VM and we scheduled to run it daily using crontab
  3. This Python script also calls the gpt-4 API to generate summaries only for the new comments
  4. Finally, this Python script updates the CSV file that is being read by the streamlit app to create the dashboard. Check it out here!

Follow the accompanying GitHub repo to create your own Hacker News/GPT-4 dashboard.

· 3 min read
Rahul Joshi
info

TL;DR: As of last week, there is a dlt pipeline that loads data from Google Analytics 4 (GA4). We’ve been excited about GA4 for a while now, so we decided to build some internal dashboards and show you how we did it.

Why GA4?

We set out to build an internal dashboard demo based on data from Google Analytics (GA4). Google announced that they will stop processing hits for Universal Analytics (UA) on July 1st, 2023, so many people are now having to figure out how to set up analytics on top of GA4 instead of UA and struggling to do so. For example, in UA, a session represents the period of time that a user is actively engaged on your site, while in GA4, a session_start event generates a session ID that is associated with all future events during the session. Our hope is that this demo helps you begin this transition!

Initial explorations

We decided to make a dashboard that helps us better understand data attribution for our blog posts (e.g. As DuckDB crosses 1M downloads / month, what do its users do?). Once we got our credentials working, we then used the GA4 dlt pipeline to load data into a DuckDB instance on our laptop. This allowed us to figure out what requests we needed to make to get the necessary data to show the impact of each blog post (e.g. across different channels, what was the subsequent engagement with our docs, etc). We founded it helpful to use GA4 Query Explorer for this.

Internal dashboard

Dashboard 1 Dashboard 2

With the data loaded locally, we were able to build the dashboard on our system using Streamlit. You can also do this on your system by simply cloning this repo and following the steps listed here.

After having the pipeline and the dashboard set up just how we liked it, we were now ready to deploy it.

Deploying the data warehouse

We decided to deploy our Streamlit app on a Google Cloud VM instance. This means that instead of storing the data locally, it would need to be in a location that could be accessed by the Streamlit app. Hence we decided to load the data onto a PostgreSQL database in the VM. See here for more details on our process.

Deploying the dlt pipeline with GitHub Actions

Once we had our data warehouse set up, we were ready to deploy the pipeline. We then followed the deploy a pipeline walkthrough to configure and deploy a pipeline that will load the data daily onto our data warehouse.

Deploying the dashboard

We finally deployed our Streamlit app on our Google Cloud VM instance by following these steps.

Enjoy this blog post? Give dlt a ⭐ on GitHub 🤜🤛

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.