Skip to main content

3 posts tagged with "OpenAI"

View All Tags

· 12 min read
Anuun Chinbat

It's been nearly half a century since cron was first introduced, and now we have a handful orchestration tools that go way beyond just scheduling tasks. With data folks constantly debating about which tools are top-notch and which ones should leave the scene, it's like we're at a turning point in the evolution of these tools. By that I mean the term 'orchestrator' has become kind of a catch-all, and that's causing some confusion because we're using this one word to talk about a bunch of different things.

dates

Think about the word “date.” It can mean a fruit, a romantic outing, or a day on the calendar, right? We usually figure out which one it is from the context, but what does context mean when it comes to orchestration? It might sound like a simple question, but it's pretty important to get this straight.

And here's a funny thing: some people, after eating an odd-tasting date (the fruit, of course), are so put off that they naively swear off going on romantic dates altogether. It's an overly exaggerated figurative way of looking at it, but it shows how one bad experience can color our view of something completely different. That's kind of what's happening with orchestration tools. If someone had a bad time with one tool, they might be overly critical towards another, even though it might be a totally different experience.

So the context in terms of orchestration tools seems to be primarily defined by one thing - WHEN a specific tool was first introduced to the market (aside from the obvious factors like the technical background of the person discussing these tools and their tendency to be a chronic complainer 🙄).


IT'S ALL ABOUT TIMING!

evolution-of-data-orchestration

The Illegitimate Child

Cron was initially released in 1975 and is undoubtedly the father of all scheduling tools, including orchestrators, but I’m assuming Cron didn’t anticipate this many offspring in the field of data (or perhaps it did). As Oracle brought the first commercial relational database to market in 1979, people started to realize that data needs to be moved on schedule, and without manual effort. And it was doable, with the help of Control-M, though it was more of a general workflow automation tool that didn’t pay special attention to data workflows.

Basically, since the solutions weren’t data driven at that time, it was more “The job gets done, but without a guarantee of data quality.”

Finally Adopted

Unlike Control-M, Informatica was designed for data operations in mind from the beginning. As data started to spread across entire companies, advanced OLAPs started to emerge with a broad use of datawarehousing. Now data not only needed to be moved, but integrated across many systems and users. The data orchestration solution from Informatica was inevitably influenced by the rising popularity of the contemporary drag-and-drop concept, that is, to the detriment of many modern data engineers who would recommend to skip Informatica and other GUI based ETL tools that offer ‘visual programming’.

As the creator of Airflow, Max Beauchemin, said: “There's a multitude of reasons why complex pieces of software are not developed using drag and drop tools: it's that ultimately code is the best abstraction there is for software.

To Be Free, That Is, Diverse

With traditional ETL tools, such as IBM DataStage and Talend, becoming well-established in the 1990s and early 2000s, the big data movement started gaining its momentum with Hadoop as the main star. Oozie, later made open-source in 2011, was tasked with workflow scheduling of Hadoop jobs, with closed-source solutions, like K2View starting to operate behind the curtains.

Fast forward a bit, and the scene exploded, with Airflow quickly becoming the heavyweight champ, while every big data service out there began rolling out their own orchestrators. This burst brought diversity, but with diversity came a maze of complexity. All of a sudden, there’s an orchestrator for everyone — whether you’re chasing features or just trying to make your budget work 👀 — picking the perfect one for your needs has gotten even trickier.

types

The Bottom Line

The thing is that every tool out there has some inconvenient truths, and real question isn't about escaping the headache — it's about choosing your type of headache. Hence, the endless sea of “versus” articles, blog posts, and guides trying to help you pick your personal battle.

A Redditor: “Everyone has hated all orchestration tools for all time. People just hated Airflow less and it took off.“

What I'm getting at is this: we're all a bit biased by the "law of the instrument." You know, the whole “If all you have is a hammer, everything looks like a nail” thing. Most engineers probably grabbed the latest or most hyped tool when they first dipped their toes into data orchestration and have stuck with it ever since. Sure, Airflow is the belle of the ball for the community, but there's a whole lineup of contenders vying for the spotlight.

law-of-instrument

And there are obviously those who would relate to the following:

reddit-screenshot


A HANDY DETOUR POUR TOI 💐

The Fundamentals

About Airflow

Miscellaneous


WHAT THE FUTURE HOLDS...

I'm no oracle or tech guru, but it's pretty obvious that at their core, most data orchestration tools are pretty similar. They're like building blocks that can be put together in different ways—some features come, some go, and users are always learning something new or dropping something old. So, what's really going to make a difference down the line is NOT just about having the coolest features. It's more about having a strong community that's all in on making the product better, a welcoming onboarding process that doesn't feel like rocket science, and finding that sweet spot between making things simple to use and letting users tweak things just the way they like.

In other words, it's not just about what the tools can do, but how people feel about using them, learning them, contributing to them, and obviously how much they spend to maintain them. That's likely where the future winners in the data orchestration game will stand out. But don’t get me wrong, features are important — it's just that there are other things equally important.


I’ve been working on this article for a WHILE now, and, honestly, it's been a bit of a headache trying to gather any solid, objective info on which data orchestration tool tops the charts. The more I think about it, the more I realise it's probably because trying to measure "the best" or "most popular" is a bit like trying to catch smoke with your bare hands — pretty subjective by nature. Plus, only testing them with non-production level data probably wasn't my brightest move.

However, I did create a fun little project where I analysed the sentiment of comments on articles about selected data orchestrators on Hacker News and gathered Google Trends data for the past year.

Just a heads-up, though: the results are BY NO MEANS reliable and are skewed due to some fun with words. For instance, searching for “Prefect” kept leading me to articles about Japanese prefectures, “Keboola” resulted in Kool-Aid content, and “Luigi”... well, let’s just say I ran into Mario’s brother more than once 😂.


THE FUN LITTLE PROJECT

Straight to the GitHub repo.

I used Dagster and dlt to load data into Snowflake, and since both of them have integrations with Snowflake, it was easy to set things up and have them all running:

Pipeline overview

This project is very minimal, including just what's needed to run Dagster locally with dlt. Here's a quick breakdown of the repo’s structure:

  1. .dlt: Utilized by the dlt library for storing configuration and sensitive information. The Dagster project is set up to fetch secret values from this directory as well.
  2. charts: Used to store chart images generated by assets.
  3. dlt_dagster_snowflake_demo: Your Dagster package, comprising Dagster assets, dlt resources, Dagster resources, and general project configurations.

Dagster Resources Explained

In the resources folder, the following two Dagster resources are defined as classes:

  1. DltPipeline: This is our dlt object defined as a Dagster ConfigurableResource that creates and runs a dlt pipeline with the specified data and table name. It will later be used in our Dagster assets to load data into Snowflake.

    class DltPipeline(ConfigurableResource):
    # Initialize resource with pipeline details
    pipeline_name: str
    dataset_name: str
    destination: str

    def create_pipeline(self, resource_data, table_name):
    """
    Creates and runs a dlt pipeline with specified data and table name.

    Args:
    resource_data: The data to be processed by the pipeline.
    table_name: The name of the table where data will be loaded.

    Returns:
    The result of the pipeline execution.
    """

    # Configure the dlt pipeline with your destination details
    pipeline = dlt.pipeline(
    pipeline_name=self.pipeline_name,
    destination=self.destination,
    dataset_name=self.dataset_name
    )

    # Run the pipeline with your parameters
    load_info = pipeline.run(resource_data, table_name=table_name)
    return load_info
  2. LocalFileStorage: Manages the local file storage, ensuring the storage directory exists and allowing data to be written to files within it. It will be later used in our Dagster assets to save images into the charts folder.

dlt Explained

In the dlt folder within dlt_dagster_snowflake_demo, necessary dlt resources and sources are defined. Below is a visual representation illustrating the functionality of dlt:

dlt explained

  1. hacker_news: A dlt resource that yields stories related to specified orchestration tools from Hackernews. For each tool, it retrieves the top 5 stories that have at least one comment. The stories are then appended to the existing data.

    Note that the write_disposition can also be set to merge or replace:

    • The merge write disposition merges the new data from the resource with the existing data at the destination. It requires a primary_key to be specified for the resource. More details can be found here.
    • The replace write disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

    More details can be found here.

  2. comments: A dlt transformer - a resource that receives data from another resource. It fetches comments for each story yielded by the hacker_news function.

  3. hacker_news_full: A dlt source that extracts data from the source location using one or more resource components, such as hacker_news and comments. To illustrate, if the source is a database, a resource corresponds to a table within that database.

  4. google_trends: A dlt resource that fetches Google Trends data for specified orchestration tools. It attempts to retrieve the data multiple times in case of failures or empty responses. The retrieved data is then appended to the existing data.

As you may have noticed, the dlt library is designed to handle the unnesting of data internally. When you retrieve data from APIs like Hacker News or Google Trends, dlt automatically unpacks the nested structures into relational tables, creating and linking child and parent tables. This is achieved through unique identifiers (_dlt_id and _dlt_parent_id) that link child tables to specific rows in the parent table. However, it's important to note that you have control over how this unnesting is done.

The Results

Alright, so once you've got your Dagster assets all materialized and data loaded into Snowflake, let's take a peek at what you might see:

sentiment counts

I understand if you're scratching your head at first glance, but let me clear things up. Remember those sneaky issues I mentioned with Keboola and Luigi earlier? Well, I've masked their charts with the respective “culprits”.

Now, onto the bars. Each trio of bars illustrates the count of negative, neutral, and positive comments on articles sourced from Hacker News that have at least one comment and were returned when searched for a specific orchestration tool, categorized accordingly by the specific data orchestration tool.

What's the big reveal? It seems like Hacker News readers tend to spread more positivity than negativity, though neutral comments hold their ground.

And, as is often the case with utilizing LLMs, this data should be taken with a grain of salt. It's more of a whimsical exploration than a rigorous analysis. However, if you take a peek behind Kool Aid and Luigi, it's intriguing to note that articles related to them seem to attract a disproportionate amount of negativity. 😂


IF YOU'RE STILL HERE

… and you're just dipping your toes into the world of data orchestration, don’t sweat it. It's totally normal if it doesn't immediately click for you. For beginners, it can be tricky to grasp because in small projects, there isn't always that immediate need for things to happen "automatically" - you build your pipeline, run it once, and then bask in the satisfaction of your results - just like I did in my project. However, if you start playing around with one of these tools now, it could make it much easier to work with them later on. So, don't hesitate to dive in and experiment!

… And hey, if you're a seasoned pro about to drop some knowledge bombs, feel free to go for it - because what doesn’t challenge us, doesn’t change us 🥹. (*Cries in Gen Z*)

· 6 min read
Anuun Chinbat

THE PROBLEM

There are two types of people: those who hoard thousands of unread emails in their inbox and those who open them immediately to avoid the ominous red notification. But one thing unites us all: everyone hates emails. The reasons are clear:

  • They're often unnecessarily wordy, making them time-consuming.
  • SPAM (obviously).
  • They become black holes of lost communication because CC/BCC-ing people doesn't always work.
  • Sometimes, there are just too many.

So, this post will explore a possible remedy to the whole email issue involving AI.


THE SOLUTION

Don't worry; it's nothing overly complex, but it does involve some cool tools that everyone could benefit from.

💡 In a nutshell, I created two flows (a main flow and a subflow) in Kestra :

  • The main flow extracts email data from Gmail and loads it into BigQuery using dlt, checks for new emails, and, if found, triggers the subflow for further processing.
  • The subflow utilizes OpenAI to summarize and analyze the sentiment of an email, loads the results into BigQuery, and then notifies about the details via Slack.

Just so you're aware:

  • Kestra is an open-source automation tool that makes both scheduled and event-driven workflows easy.
  • dlt is an open-source library that you can add to your Python scripts to load data from various and often messy data sources into well-structured, live datasets.
tip

Wanna jump to the GitHub repo?


HOW IT WORKS

To lay it all out clearly: Everything's automated in Kestra, with hassle-free data loading thanks to dlt, and the analytical thinking handled by OpenAI. Here's a diagram to help you understand the general outline of the entire process.

overview

Now, let's delve into specific parts of the implementation.

The environment:

💡 The two flows in Kestra are set up in a very straightforward and intuitive manner. Simply follow the Prerequisites and Setup guidelines in the repo. It should take no more than 15 minutes.

Once you’ve opened http://localhost:8080/ in your browser, this is what you’ll see on your screen:

Kestra

Now, all you need to do is create your flows and execute them.

The great thing about Kestra is its ease of use - it's UI-based, declarative, and language-agnostic. Unless you're using a task like a Python script, you don't even need to know how to code.

tip

If you're already considering ways to use Kestra for your projects, consult their documentation and the plugin pages for further insights.

The data loading part

💡 This is entirely managed by dlt in just five lines of code.

I set up a pipeline using the Inbox source – a regularly tested and verified source from dlt – with BigQuery as the destination.

In my scenario, the email data doesn't have nested structures, so there's no need for flattening. However, if you encounter nested structures in a different use case, dlt can automatically normalize them during loading.

Here's how the pipeline is defined and subsequently run in the first task of the main flow in Kestra:

# Run dlt pipeline to load email data from gmail to BigQuery
pipeline = dlt.pipeline(
pipeline_name="standard_inbox",
destination='bigquery',
dataset_name="messages_data",
full_refresh=False,
)

# Set table name
table_name = "my_inbox"
# Get messages resource from the source
messages = inbox_source(start_date = pendulum.datetime(2023, 11, 15)).messages
# Configure the messages resource to get bodies of the emails
messages = messages(include_body=True).with_name(table_name)
# Load data to the "my_inbox" table
load_info = pipeline.run(messages)

In this setup ☝️, dlt loads all email data into the table “my_inbox”, with the email body specifically stored in the “body” column. After executing your flow in Kestra, the table in BigQuery should appear as shown below:

bigquery_my_inbox

tip

This implementation doesn't handle email attachments, but if you need to analyze, for instance, invoice PDFs from your inbox, you can read about how to automate this with dlt here.

The AI part

💡 In this day and age, how can we not incorporate AI into everything? 😆

But seriously, if you're familiar with OpenAI, it's a matter of an API call to the chat completion endpoint. What simplifies it even further is Kestra’s OpenAI plugin.

In my subflow, I used it to obtain both the summary and sentiment analysis of each email body. Here's a glimpse of how it's implemented:

- id: get_summary
type: io.kestra.plugin.openai.ChatCompletion
apiKey: "{{ secret('OPENAI_API') }}"
model: gpt-3.5-turbo
prompt: "Summarize the email content in one sentence with less than 30 words: {{inputs.data[0]['body']}}"
messages: [{"role": "system", "content": "You are a tool that summarizes emails."}]
info

Kestra also includes Slack, as well as BigQuery plugins, which I used in my flows. Additionally, there is a wide variety of other plugins available.

The automation part

💡 Kestra triggers are the ideal solution!

I’ve used a schedule trigger that allows you to execute your flow on a regular cadence e.g. using a CRON expression:

triggers:
- id: schedule
type: io.kestra.core.models.triggers.types.Schedule
cron: "0 9-18 * * 1-5"

This configuration ensures that your flows are executed hourly on workdays from 9 AM to 6 PM.


THE OUTCOME

A Slack assistant that delivers crisp inbox insights right at your fingertips:

slack.png

And a well-organized table in BigQuery, ready for you to dive into a more complex analysis:

bigquery_test.png

In essence, using Kestra and dlt offers a trio of advantages for refining email analysis and data workflows:

  1. Efficient automation: Kestra effortlessly orchestrates intricate workflows, integrating smoothly with tools like dlt, OpenAI, and BigQuery. This process reduces manual intervention while eliminating errors, and freeing up more time for you.
  2. User-friendly and versatile: Both Kestra and dlt are crafted for ease of use, accommodating a range of skill levels. Their adaptability extends to various use cases.
  3. Seamless scaling: Kestra, powered by Kafka and Elasticsearch, adeptly manages large-scale data and complex workflows. Coupled with dlt's solid data integration capabilities, it ensures a stable and reliable solution for diverse requirements.

HOW IT COULD WORK ELSEWHERE

Basically, you can apply the architecture discussed in this post whenever you need to automate a business process!

For detailed examples of how Kestra can be utilized in various business environments, you can explore Kestra's use cases.

Embrace automation, where the only limit is your imagination! 😛

· 10 min read
Anton Burnashev

tl;dr: In this blog post, we'll build a RAG chatbot for Zendesk Support data using Verba and dlt.

As businesses scale and the volume of internal knowledge grows, it becomes increasingly difficult for everyone in the company to find the right information at the right time.

With the latest advancements in large language models (LLMs) and vector databases, it's now possible to build a new class of tools that can help get insights from this data. One approach to do so is Retrieval-Augmented Generation (RAG). The idea behind RAGs is to retrieve relevant information from your database and use LLMs to generate a customised response to a question. Leveraging RAG enables the LLM to tailor its responses based on your proprietary data.

Diagram illustrating the process of internal business knowledge retrieval and augmented generation (RAG), involving components like Salesforce, Zendesk, Asana, Jira, Notion, Slack and HubSpot, to answer user queries and generate responses.

One such source of internal knowledge is help desk software. It contains a wealth of information about the company's customers and their interactions with the support team.

In this blog post, we'll guide you through the process of building a RAG application for Zendesk Support data, a popular help desk software. We’re going to use dlt, Weaviate, Verba and OpenAI.

dlt is an open-source Python library that simplifies the process of loading data from various sources. It does not requires extensive setup or maintenance and particularly useful for CRM data: highly tailored to the needs of the business and changes frequently.

Weaviate is an open-source, AI-native vector database that is redefining the foundation of AI-powered applications. With capabilities for vector, hybrid, and generative search over billions of data objects, Weaviate serves as the critical infrastructure for organizations building sophisticated AI solutions and exceptional user experiences.

Verba is an open-source chatbot powered by Weaviate. It's built on top of Weaviate's state-of-the-art Generative Search technology. Verba includes a web interface and a query engine that uses Weaviate database.

Prerequisites

  1. A URL and an API key of a Weaviate instance. We're using the hosted version of Weaviate to store our data. Head over to the Weaviate Cloud Services and create a new cluster. You can create a free sandbox, but keep in mind your cluster will expire and your data will be deleted after 14 days. In the "Details" of your cluster you'll find the Cluster URL and the API key.
  2. An OpenAI account and API key. Verba utilizes OpenAI's models to generate answers to user's questions and Weaviate uses them to vectorize text before storing it in the database. You can sign up for an account on OpenAI's website.
  3. A Zendesk account and API credentials.

Let’s get started

Step 1. Set up Verba

Create a new folder for your project and install Verba:

mkdir verba-dlt-zendesk
cd verba-dlt-zendesk
python -m venv venv
source venv/bin/activate
pip install goldenverba

To configure Verba, we need to set the following environment variables:

VERBA_URL=https://your-cluster.weaviate.network # your Weaviate instance URL
VERBA_API_KEY=F8...i4WK # the API key of your Weaviate instance
OPENAI_API_KEY=sk-...R # your OpenAI API key

You can put them in a .env file in the root of your project or export them in your shell.

Let's test that Verba is installed correctly:

verba start

You should see the following output:

INFO:     Uvicorn running on <http://0.0.0.0:8000> (Press CTRL+C to quit)
ℹ Setting up client
✔ Client connected to Weaviate Cluster
INFO: Started server process [50252]
INFO: Waiting for application startup.
INFO: Application startup complete.

Now, open your browser and navigate to http://localhost:8000.

A user interface screenshot showing Verba, retrieval and augmented generation chatbot, powered by Weaviate

Great! Verba is up and running.

If you try to ask a question now, you'll get an error in return. That's because we haven't imported any data yet. We'll do that in the next steps.

Step 2. Install dlt with Zendesk source

We get our data from Zendesk using dlt. Let's install it along with the Weaviate extra:

pip install "dlt[weaviate]"

This also installs a handy CLI tool called dlt. It will help us initialize the Zendesk verified data source—a connector to Zendesk Support API.

Let's initialize the verified source:

dlt init zendesk weaviate

dlt init pulls the latest version of the connector from the verified source repository and creates a credentials file for it. The credentials file is called secrets.toml and it's located in the .dlt directory.

To make things easier, we'll use the email address and password authentication method for Zendesk API. Let's add our credentials to secrets.toml:

[sources.zendesk.credentials]
password = "your-password"
subdomain = "your-subdomain"
email = "your-email@example.com"

We also need to specify the URL and the API key of our Weaviate instance. Copy the credentials for the Weaviate instance you created earlier and add them to secrets.toml:

[destination.weaviate.credentials]
url = "https://your-cluster.weaviate.network"
api_key = "F8.....i4WK"

[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "sk-....."

All the components are now in place and configured. Let's set up a pipeline to import data from Zendesk.

Step 3. Set up a dlt pipeline

Open your favorite text editor and create a file called zendesk_verba.py. Add the following code to it:

import itertools

import dlt
from weaviate.util import generate_uuid5
from dlt.destinations.adapters import weaviate_adapter

from zendesk import zendesk_support

def to_verba_document(ticket):
# The document id is the ticket id.
# dlt will use this to generate a UUID for the document in Weaviate.
return {
"doc_id": ticket["id"],
"doc_name": ticket["subject"],
"doc_type": "Zendesk ticket",
"doc_link": ticket["url"],
"text": ticket["description"],
}

def to_verba_chunk(ticket):
# We link the chunk to the document by using the document (ticket) id.
return {
"chunk_id": 0,
"doc_id": ticket["id"],
"doc_name": ticket["subject"],
"doc_type": "Zendesk ticket",
"doc_uuid": generate_uuid5(ticket["id"], "Document"),
"text": ticket["description"],
}

def main():
pipeline = dlt.pipeline(
pipeline_name="zendesk_verba",
destination="weaviate",
)

# Zendesk Support has data tickets, users, groups, etc.
zendesk_source = zendesk_support(load_all=False)

# Here we get a dlt resource containing only the tickets
tickets = zendesk_source.tickets

# Split the tickets into two streams
tickets1, tickets2 = itertools.tee(tickets, 2)

@dlt.resource(primary_key="doc_id", write_disposition="merge")
def document():
# Map over the tickets and convert them to Verba documents
# primary_key is the field that will be used to generate
# a UUID for the document in Weaviate.
yield from weaviate_adapter(
map(to_verba_document, tickets1),
vectorize="text",
)

@dlt.resource(primary_key="doc_id", write_disposition="merge")
def chunk():
yield from weaviate_adapter(
map(to_verba_chunk, tickets2),
vectorize="text",
)

info = pipeline.run([document, chunk])

return info

if __name__ == "__main__":
load_info = main()
print(load_info)

There's a lot going on here, so let's break it down.

First, in main() we create a dlt pipeline and add a Weaviate destination to it. We'll use it to store our data.

Next, we create a Zendesk Support source. It will fetch data from Zendesk Support API.

To match the data model of Zendesk Support to the internal data model of Verba, we need to convert Zendesk tickets to Verba documents and chunks. We do that by defining two functions: to_verba_document and to_verba_chunk. We also create two streams of tickets. We'll use them to create two dlt resources: document and chunk. These will populate the Document and Chunk classes in Verba. In both resources we instruct dlt which fields to vectorize using the weaviate_adapter() function.

We specify primary_key and write_disposition for both resources. primary_key is the field that will be used to generate a UUID for the document in Weaviate. write_disposition tells dlt how to handle duplicate documents. In our case, we want to merge them: if a document already exists in Weaviate, we want to update it with the new data.

Finally, we run the pipeline and print the load info.

Step 4. Load data into Verba

Let's run the pipeline:

python zendesk_verba.py

You should see the following output:

Pipeline zendesk_verba completed in 8.27 seconds
1 load package(s) were loaded to destination weaviate and into dataset None
The weaviate destination used <https://your-cluster.weaviate.network> location to store data
Load package 1695726495.383148 is LOADED and contains no failed jobs

Verba is now populated with data from Zendesk Support. However there are a couple of classes that need to be created in Verba: Cache and Suggestion. We can do that using the Verba CLI init command. When it runs it will ask us if we want to create Verba classes. Make sure to answer "n" to the question about the Document class — we don't want to overwrite it.

Run the following command:

verba init

You should see the following output:

===================== Creating Document and Chunk class =====================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
Document class already exists, do you want to overwrite it? (y/n): n
⚠ Skipped deleting Document and Chunk schema, nothing changed
ℹ Done

============================ Creating Cache class ============================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
'Cache' schema created
ℹ Done

========================= Creating Suggestion class =========================
ℹ Setting up client
✔ Client connected to Weaviate Cluster
'Suggestion' schema created
ℹ Done

We're almost there! Let's start Verba:

verba start

Step 4. Ask Verba a question

Head back to http://localhost:8000 and ask Verba a question. For example, "What are common issues our users report?".

A user interface screenshot of Verba showing Zendesk tickets with different issues like API problems and update failures, with responses powered by Weaviate

As you can see, Verba is able to retrieve relevant information from Zendesk Support and generate an answer to our question. It also displays the list of relevant documents for the question. You can click on them to see the full text.

Conclusion

In this blog post, we've built a RAG application for Zendesk Support using Verba and dlt. We've learned:

  • How easy it is to get started with Verba.
  • How to build dlt pipeline with a ready-to-use data source.
  • How to customize the pipeline so it matches the data model of Verba.

Where to go next?

  • Ensure your data is up-to-date. With dlt deploy you can deploy your pipeline to Google's Cloud Composer or GitHub Actions and run it on a schedule.
  • Build a Verba RAG for other data sources. Interested in building a RAG that queries other internal business knowledge than Zendesk? With dlt you can easily switch your data source. Other dlt verified sources of internal business knowledge include Asana, Hubspot, Jira, Notion, Slack and Salesforce. However, dlt isn’t just about ready-to-use data sources; many of our users choose to implement their own custom data sources.
  • Learn more about how Weaviate works. Check out Zero to MVP course to learn more about Weaviate database and how to use it to build your own applications.
  • Request more features. A careful reader might have noticed that we used both Document and Chunk classes in Verba for the same type of data. For simplicity's sake, we assumed that the ticket data is small enough to fit into a single chunk. However, if you if you're dealing with larger documents, you might consider splitting them into chunks. Should we add chunking support to dlt? Or perhaps you have other feature suggestions? If so, please consider opening a feature request in the dlt repo to discuss your ideas!

Let's stay in touch

If you have any questions or feedback, please reach out to us on the dltHub Slack.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.