RAG playground: Build your own RAG bot
- Adrian Brudaru,
Co-Founder & CDO
Workshop overview
We recently conducted a workshop on Retrieval-Augmented Generation (RAG) creation at Data Talks Club - LLM Zoomcamp. In this workshop look into the process of loading data and creating your own RAG system. We first load data and embeddings from a Notion page into LanceDB and develop a RAG Bot using Ollama. Finally we interact with the bot by asking it questions. Below, you'll find a summary of the resources, tools, and examples we discussed during the session.
Key resources
- dlt: Data loading and transformation.
- LanceDB: An efficient vector database.
- Ollama: Local LLMs for Retrieval-Augmented Generation.
- Data Talks Club (DTC): A vibrant community for data engineering resources.
Workshop content
In this workshop, we explored the fundamentals of creating a Retrieval-Augmented Generation (RAG) system. You can follow along with the detailed workshop video or access the Google Colab notebook for hands-on experience.
1. Introduction to dlt
and LanceDB:
- Loading data into LanceDB:
- Install the necessary packages:
dlt[lancedb]
andsentence-transformers
. - Load course Q&A data into LanceDB without embeddings.
- Create and execute a
dlt
pipeline to load data into LanceDB.
- Install the necessary packages:
2. Embedding data in LanceDB:
- Set up the embedding model using environment variables.
- Load and embed data into a new LanceDB table using
lancedb_adapter
.
3. Creating a Notion to LanceDB pipeline:
- Install requirements:
- Install
dlt[lancedb]
andsentence-transformers
.
- Install
- Create a
dlt
project:- Run the command
dlt init rest_api lancedb
to set up the project. - Read more about the REST API verified source here.
- Run the command
- Add API credentials:
- Obtain your Notion API key and store it in environment variables or
secrets.toml
.
- Obtain your Notion API key and store it in environment variables or
- Write the pipeline code:
- Configure the
dlt
REST API source to connect to the Notion API. - Extract relevant content from the Notion API responses.
- Load data incrementally to ensure only new or changed data is added.
- Configure the
4. Running the pipeline:
- Define and run the pipeline to load and embed data from Notion into LanceDB using
lancedb_adapter
.
5. Creating a RAG Bot with Ollama:
- Setup:
- Install and start Ollama.
- Download the desired LLM model (e.g.,
llama2-uncensored
).
- Write functions:
- Retrieve relevant content from LanceDB based on user queries.
- Create a simple RAG bot with Ollama to provide context-aware answers.
Example Questions for the RAG Bot:
- How many vacation days do I get?
- Can I get maternity leave?
To go through these steps in detail please follow the Google Collab notebook here.
If you have any questions, join our community on Slack or reach out during our next workshop session!
DTC Learners showcase
Check out the incredible projects from our DTC learners:
- LLM zoomcamp by aivisbr
- LLM zoomcamp by Martin Dornic
- llm-zoomcamp-dbeta95 by Daniel Betancur
- llm-zoomcamp by Alex
- llm-zoomcamp by AlZrSe
- LLMs-Course-DataTalksClub by Peris
- LLM-zoomcamp by Vladislav Garist
- datatalks-llm-zoomcamp by Agustín Vargas Toro
- llm_zoomcamp by Jorge V. Abrego
- llm-zoomcamp by Mahmoud kamal
- llm_zoomcamp by aturevich
- llm-zoomcamp-homework by Ramzi Hadrich
- llm-zoomcamp-2024 by SofyanAkbar94
Do you want to participate in our future workshops?
Sign up for our newsletter or keep an eye on our events page for workshop announcements.