Skip to main content
Version: 0.5.4

Qdrant

Qdrant is an open-source, high-performance vector search engine/database. It deploys as an API service, providing a search for the nearest high-dimensional vectors. This destination helps you load data into Qdrant from dlt resources.

Setup Guideโ€‹

  1. To use Qdrant as a destination, make sure dlt is installed with the qdrant extra:
pip install "dlt[qdrant]"
  1. Next, configure the destination in the dlt secrets file. The file is located at ~/.dlt/secrets.toml by default. Add the following section to the secrets file:
[destination.qdrant.credentials]
location = "https://your-qdrant-url"
api_key = "your-qdrant-api-key"

In this setup guide, we are using the Qdrant Cloud to get a hosted Qdrant instance and the FastEmbed package that is built into the Qdrant client library for generating embeddings.

If no configuration options are provided, the default fallback will be http://localhost:6333 with no API key.

  1. Define the source of the data. For starters, let's load some data from a simple data structure:
import dlt
from dlt.destinations.adapters import qdrant_adapter

movies = [
{
"title": "Blade Runner",
"year": 1982,
},
{
"title": "Ghost in the Shell",
"year": 1995,
},
{
"title": "The Matrix",
"year": 1999,
}
]
  1. Define the pipeline:
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="qdrant",
dataset_name="MoviesDataset",
)
  1. Run the pipeline:
info = pipeline.run(
qdrant_adapter(
movies,
embed="title",
)
)
  1. Check the results:
print(info)

The data is now loaded into Qdrant.

To use vector search after the data has been loaded, you must specify which fields Qdrant needs to generate embeddings for. You do that by wrapping the data (or dlt resource) with the qdrant_adapter function.

qdrant_adapterโ€‹

The qdrant_adapter is a helper function that configures the resource for the Qdrant destination:

qdrant_adapter(data, embed)

It accepts the following arguments:

  • data: a dlt resource object or a Python data structure (e.g., a list of dictionaries).
  • embed: a name of the field or a list of names to generate embeddings for.

Returns: dlt resource object that you can pass to the pipeline.run().

Example:

qdrant_adapter(
resource,
embed=["title", "description"],
)

When using the qdrant_adapter, it's important to apply it directly to resources, not to the whole source. Here's an example:

products_tables = sql_database().with_resources("products", "customers")

pipeline = dlt.pipeline(
pipeline_name="postgres_to_qdrant_pipeline",
destination="qdrant",
)

# apply adapter to the needed resources
qdrant_adapter(products_tables.products, embed="description")
qdrant_adapter(products_tables.customers, embed="bio")

info = pipeline.run(products_tables)
tip

A more comprehensive pipeline would load data from some API or use one of dlt's verified sources.

Write dispositionโ€‹

A write disposition defines how the data should be written to the destination. All write dispositions are supported by the Qdrant destination.

Replaceโ€‹

The replace disposition replaces the data in the destination with the data from the resource. It deletes all the classes and objects and recreates the schema before loading the data.

In the movie example from the setup guide, we can use the replace disposition to reload the data every time we run the pipeline:

info = pipeline.run(
qdrant_adapter(
movies,
embed="title",
),
write_disposition="replace",
)

Mergeโ€‹

The merge write disposition merges the data from the resource with the data at the destination. For the merge disposition, you need to specify a primary_key for the resource:

info = pipeline.run(
qdrant_adapter(
movies,
embed="title",
),
primary_key="document_id",
write_disposition="merge"
)

Internally, dlt will use primary_key (document_id in the example above) to generate a unique identifier (UUID) for each point in Qdrant. If the object with the same UUID already exists in Qdrant, it will be updated with the new data. Otherwise, a new point will be created.

caution

If you are using the merge write disposition, you must set it from the first run of your pipeline; otherwise, the data will be duplicated in the database on subsequent loads.

Appendโ€‹

This is the default disposition. It will append the data to the existing data in the destination, ignoring the primary_key field.

Dataset nameโ€‹

Qdrant uses collections to categorize and identify data. To avoid potential naming conflicts, especially when dealing with multiple datasets that might have overlapping table names, dlt includes the dataset name in the Qdrant collection name. This ensures a unique identifier for every collection.

For example, if you have a dataset named movies_dataset and a table named actors, the Qdrant collection name would be movies_dataset_actors (the default separator is an underscore).

However, if you prefer to have class names without the dataset prefix, skip the dataset_name argument.

For example:

pipeline = dlt.pipeline(
pipeline_name="movies",
destination="qdrant",
)

Additional destination optionsโ€‹

  • embedding_batch_size: (int) The batch size for embedding operations. The default value is 32.

  • embedding_parallelism: (int) The number of concurrent threads to run embedding operations. Defaults to the number of CPU cores.

  • upload_batch_size: (int) The batch size for data uploads. The default value is 64.

  • upload_parallelism: (int) The maximum number of concurrent threads to run data uploads. The default value is 1.

  • upload_max_retries: (int) The number of retries to upload data in case of failure. The default value is 3.

  • options: (QdrantClientOptions) An instance of the QdrantClientOptions class that holds various Qdrant client options.

  • model: (str) The name of the FlagEmbedding model to use. See the list of supported models at Supported Models. The default value is "BAAI/bge-small-en".

Qdrant Client Optionsโ€‹

The QdrantClientOptions class provides options for configuring the Qdrant client.

  • port: (int) The port of the REST API interface. The default value is 6333.

  • grpc_port: (int) The port of the gRPC interface. The default value is 6334.

  • prefer_grpc: (bool) If true, the client will prefer to use the gRPC interface whenever possible in custom methods. The default value is false.

  • https: (bool) If true, the client will use the HTTPS (SSL) protocol. The default value is true if an API Key is provided, else false.

  • prefix: (str) If set, it adds the specified prefix to the REST URL path. For example, setting it to "service/v1" will result in the REST API URL as http://localhost:6333/service/v1/{qdrant-endpoint}. Not set by default.

  • timeout: (int) The timeout for REST and gRPC API requests. The default value is 5.0 seconds for REST and unlimited for gRPC.

  • host: (str) The host name of the Qdrant service. If both the URL and host are None, it is set to localhost.

  • path: (str) The persistence path for a local Qdrant instance. Not set by default.

Run Qdrant locallyโ€‹

You can find the setup instructions to run Qdrant here

Syncing of dlt stateโ€‹

Qdrant destination supports syncing of the dlt state.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.