LanceDB
LanceDB is an open-source, high-performance vector database. It allows you to store data objects and perform similarity searches over them. This destination helps you load data into LanceDB from dlt resources.
Setup guideโ
Choosing a model providerโ
First, you need to decide which embedding model provider to use. You can find all supported providers by visiting the official LanceDB docs.
Install dlt with LanceDBโ
To use LanceDB as a destination, make sure dlt
is installed with the lancedb
extra:
pip install "dlt[lancedb]"
The lancedb extra only installs dlt
and lancedb
. You will need to install your model provider's SDK.
You can find which libraries you need by also referring to the LanceDB docs.
Configure the destinationโ
Configure the destination in the dlt secrets file located at ~/.dlt/secrets.toml
by default. Add the following section:
[destination.lancedb]
embedding_model_provider = "ollama"
embedding_model = "mxbai-embed-large"
embedding_model_provider_host = "http://localhost:11434" # Optional: custom endpoint for providers that support it
[destination.lancedb.credentials]
uri = ".lancedb"
api_key = "api_key" # API key to connect to LanceDB Cloud. Leave out if you are using LanceDB OSS.
embedding_model_provider_api_key = "embedding_model_provider_api_key" # Not needed for providers that don't need authentication (ollama, sentence-transformers).
- The
uri
specifies the location of your LanceDB instance. It defaults to a local, on-disk instance if not provided. - The
api_key
is your API key for LanceDB Cloud connections. If you're using LanceDB OSS, you don't need to supply this key. - The
embedding_model_provider
specifies the embedding provider used for generating embeddings. The default iscohere
. - The
embedding_model
specifies the model used by the embedding provider for generating embeddings. Check with the embedding provider which options are available. Reference https://lancedb.github.io/lancedb/embeddings/default_embedding_functions/. - The
embedding_model_provider_host
specifies the full host URL with protocol and port for providers that support custom endpoints (like Ollama). If not specified, the provider's default endpoint will be used. - The
embedding_model_provider_api_key
is the API key for the embedding model provider used to generate embeddings. If you're using a provider that doesn't need authentication, such as Ollama, you don't need to supply this key.
- "gemini-text"
- "bedrock-text"
- "cohere"
- "gte-text"
- "imagebind"
- "instructor"
- "open-clip"
- "openai"
- "sentence-transformers"
- "huggingface"
- "colbert"
- "ollama"
Define your data sourceโ
For example:
import dlt
from dlt.destinations.adapters import lancedb_adapter
movies = [
{
"id": 1,
"title": "Blade Runner",
"year": 1982,
},
{
"id": 2,
"title": "Ghost in the Shell",
"year": 1995,
},
{
"id": 3,
"title": "The Matrix",
"year": 1999,
},
]
Create a pipeline:โ
pipeline = dlt.pipeline(
pipeline_name="movies",
destination="lancedb",
dataset_name="MoviesDataset",
)
Run the pipeline:โ
info = pipeline.run(
lancedb_adapter(
movies,
embed="title",
)
)
The data is now loaded into LanceDB.
To use vector search after loading, you must specify which fields LanceDB should generate embeddings for. Do this by wrapping the data (or dlt resource) with the lancedb_adapter
function.
Using an adapter to specify columns to vectorizeโ
Out of the box, LanceDB will act as a normal database. To use LanceDB's embedding facilities, you'll need to specify which fields you'd like to embed in your dlt resource.
The lancedb_adapter
is a helper function that configures the resource for the LanceDB destination:
lancedb_adapter(data, embed="title")
It accepts the following arguments:
data
: a dlt resource object, or a Python data structure (e.g., a list of dictionaries).embed
: a name of the field or a list of names to generate embeddings for.
Returns: dlt resource object that you can pass to the pipeline.run()
.
Example:
lancedb_adapter(
resource,
embed=["title", "description"],
)
When using the lancedb_adapter
, it's important to apply it directly to resources, not to the whole source. Here's an example:
products_tables = sql_database().with_resources("products", "customers")
pipeline = dlt.pipeline(
pipeline_name="postgres_to_lancedb_pipeline",
destination="lancedb",
)
# Apply adapter to the needed resources
lancedb_adapter(products_tables.products, embed="description")
lancedb_adapter(products_tables.customers, embed="bio")
info = pipeline.run(products_tables)
Write dispositionโ
All write dispositions are supported by the LanceDB destination.
Replaceโ
The replace disposition replaces the data in the destination with the data from the resource.
info = pipeline.run(
lancedb_adapter(
movies,
embed="title",
),
write_disposition="replace",
)
Mergeโ
The merge write disposition merges the data from the resource with the data at the destination based on a unique identifier. The LanceDB destination merge write disposition only supports upsert strategy. This updates existing records and inserts new ones based on a unique identifier.
You can specify the merge disposition, primary key, and merge key either in a resource or adapter:
@dlt.resource(
primary_key=["doc_id", "chunk_id"],
merge_key=["doc_id"],
write_disposition={"disposition": "merge", "strategy": "upsert"},
)
def my_rag_docs(
data: List[DictStrAny],
) -> Generator[List[DictStrAny], None, None]:
yield data
Or:
pipeline.run(
lancedb_adapter(
my_new_rag_docs,
merge_key="doc_id"
),
write_disposition={"disposition": "merge", "strategy": "upsert"},
primary_key=["doc_id", "chunk_id"],
)
The primary_key
uniquely identifies each record, typically comprising a document ID and a chunk ID.
The merge_key
, which cannot be compound, should correspond to the canonical doc_id
used in vector databases and represent the document identifier in your data model.
It must be the first element of the primary_key
.
This merge_key
is crucial for document identification and orphan removal during merge operations.
This structure ensures proper record identification and maintains consistency with vector database concepts.
Orphan Removalโ
LanceDB automatically removes orphaned chunks when updating or deleting parent documents during a merge operation. To disable this feature:
pipeline.run(
lancedb_adapter(
movies,
embed="title",
no_remove_orphans=True # Disable with the `no_remove_orphans` flag.
),
write_disposition={"disposition": "merge", "strategy": "upsert"},
primary_key=["doc_id", "chunk_id"],
)
Note: While it's possible to omit the merge_key
for brevity (in which case it is assumed to be the first entry of primary_key
),
explicitly specifying both is recommended for clarity.
Appendโ
This is the default disposition. It will append the data to the existing data in the destination.
Additional destination optionsโ
dataset_separator
: The character used to separate the dataset name from table names. Defaults to "___".vector_field_name
: The name of the special field to store vector embeddings. Defaults to "vector".max_retries
: The maximum number of retries for embedding operations. Set to 0 to disable retries. Defaults to 3.
dbt supportโ
The LanceDB destination doesn't support dbt integration.
Syncing of dlt
stateโ
The LanceDB destination supports syncing of the dlt
state.
Current limitationsโ
In-memory tablesโ
Adding new fields to an existing LanceDB table requires loading the entire table data into memory as a PyArrow table. This is because PyArrow tables are immutable, so adding fields requires creating a new table with the updated schema.
For huge tables, this may impact performance and memory usage since the full table must be loaded into memory to add the new fields. Keep these considerations in mind when working with large datasets and monitor memory usage if adding fields to sizable existing tables.
Null string handling for OpenAI embeddingsโ
OpenAI embedding service doesn't accept empty string bodies. We deal with this by replacing empty strings with a placeholder that should be very semantically dissimilar to 99.9% of queries.
If your source column (column which is embedded) has empty values, it is important to consider the impact of this. There might be a slight chance that semantic queries can hit these empty strings.
We reported this issue to LanceDB: https://github.com/lancedb/lancedb/issues/1577.