Weaviate
Weaviate is an open-source vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you load data into Weaviate from dlt resources.
Setup guideโ
- To use Weaviate as a destination, make sure dlt is installed with the 'weaviate' extra:
pip install "dlt[weaviate]"
- Next, configure the destination in the dlt secrets file. The file is located at
~/.dlt/secrets.toml
by default. Add the following section to the secrets file:
[destination.weaviate.credentials]
url = "https://your-weaviate-url"
api_key = "your-weaviate-api-key"
[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"
In this setup guide, we are using the Weaviate Cloud Services to get a Weaviate instance and OpenAI API for generating embeddings through the text2vec-openai module.
You can host your own Weaviate instance using Docker Compose, Kubernetes, or embedded. Refer to Weaviate's How-to: Install or dlt recipe we use for our tests. In that case, you can skip the credentials part altogether:
[destination.weaviate.credentials.additional_headers]
X-OpenAI-Api-Key = "your-openai-api-key"
The url
will default to http://localhost:8080 and api_key
is not defined - which are the defaults for the Weaviate container.
- Define the source of the data. For starters, let's load some data from a simple data structure:
import dlt
from dlt.destinations.adapters import weaviate_adapter
movies = [
{
"title": "Blade Runner",
"year": 1982,
},
{
"title": "Ghost in the Shell",
"year": 1995,
},
{
"title": "The Matrix",
"year": 1999,
}
]
- Define the pipeline:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
dataset_name="MoviesDataset",
)
- Run the pipeline:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
)
)
- Check the results:
print(info)
The data is now loaded into Weaviate.
Weaviate destination is different from other dlt destinations. To use vector search after the data has been loaded, you must specify which fields Weaviate needs to include in the vector index. You do that by wrapping the data (or dlt resource) with the weaviate_adapter
function.
Weaviate adapterโ
The weaviate_adapter
is a helper function that configures the resource for the Weaviate destination:
weaviate_adapter(data, vectorize, tokenization)
It accepts the following arguments:
data
: a dlt resource object or a Python data structure (e.g., a list of dictionaries).vectorize
: a name of the field or a list of names that should be vectorized by Weaviate.tokenization
: the dictionary containing the tokenization configuration for a field. The dictionary should have the following structure{'field_name': 'method'}
. Valid methods are "word", "lowercase", "whitespace", "field". The default is "word". See Property tokenization in Weaviate documentation for more details.
Returns: a dlt resource object that you can pass to the pipeline.run()
.
Example:
weaviate_adapter(
resource,
vectorize=["title", "description"],
tokenization={"title": "word", "description": "whitespace"},
)
When using the weaviate_adapter
, it's important to apply it directly to resources, not to the whole source. Here's an example:
products_tables = sql_database().with_resources("products", "customers")
pipeline = dlt.pipeline(
pipeline_name="postgres_to_weaviate_pipeline",
destination="weaviate",
)
# Apply adapter to the needed resources
weaviate_adapter(products_tables.products, vectorize="description")
weaviate_adapter(products_tables.customers, vectorize="bio")
info = pipeline.run(products_tables)
A more comprehensive pipeline would load data from some API or use one of dlt's verified sources.
Write dispositionโ
A write disposition defines how the data should be written to the destination. All write dispositions are supported by the Weaviate destination.
Replaceโ
The replace disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.
In the movie example from the setup guide, we can use the replace
disposition to reload the data every time we run the pipeline:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
write_disposition="replace",
)
Mergeโ
The merge write disposition merges the data from the resource with the data in the destination.
For the merge
disposition, you would need to specify a primary_key
for the resource:
info = pipeline.run(
weaviate_adapter(
movies,
vectorize="title",
),
primary_key="document_id",
write_disposition="merge"
)
Internally, dlt will use primary_key
(document_id
in the example above) to generate a unique identifier (UUID) for each object in Weaviate. If the object with the same UUID already exists in Weaviate, it will be updated with the new data. Otherwise, a new object will be created.
If you are using the merge
write disposition, you must set it from the first run of your pipeline; otherwise, the data will be duplicated in the database on subsequent loads.
Appendโ
This is the default disposition. It will append the data to the existing data in the destination, ignoring the primary_key
field.
Data loadingโ
Loading data into Weaviate from different sources requires a proper understanding of how data is transformed and integrated into Weaviate's schema.
Data typesโ
Data loaded into Weaviate from various sources might have different types. To ensure compatibility with Weaviate's schema, there's a predefined mapping between the dlt types and Weaviate's native types:
dlt Type | Weaviate Type |
---|---|
text | text |
double | number |
bool | boolean |
timestamp | date |
date | date |
bigint | int |
binary | blob |
decimal | text |
wei | number |
json | text |
Dataset nameโ
Weaviate uses classes to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name in the Weaviate class name. This ensures a unique identifier for every class.
For example, if you have a dataset named movies_dataset
and a table named actors
, the Weaviate class name would be MoviesDataset_Actors
(the default separator is an underscore).
However, if you prefer to have class names without the dataset prefix, skip the dataset_name
argument.
For example:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="weaviate",
)
Names normalizationโ
When loading data into Weaviate, dlt tries to maintain naming conventions consistent with the Weaviate schema.
Here's a summary of the naming normalization approach: