Skip to main content
Version: devel View Markdown

Lance

Lance is an open-source columnar data format designed for AI/ML workloads, with native support for versioning, zero-copy access, and fast vector search. The lance destination lets you load data into Lance datasets stored on local disk or cloud object storage (S3, Azure, GCS).

Optionally, the destination can generate vector embeddings using the LanceDB embedding functions library.

Destination capabilitiesโ€‹

The following table shows the capabilities of the Lance destination:

FeatureValueMore
Preferred loader file formatparquetFile formats
Supported loader file formatsparquet, referenceFile formats
Supported merge strategiesupsertMerge strategy
Supported replace strategiestruncate-and-insertReplace strategy
Supports tz aware datetimeTrueTimestamps and Timezones
Supports naive datetimeTrueTimestamps and Timezones

This table shows the supported features of the Lance destination in dlt.

Lance vs. LanceDB destination

dlt ships two Lance-related destinations:

  • lance (this page) โ€” stores data on local disk or cloud object storage (S3, GCS, Azure). Uses the lance library for table management and, optionally, the lancedb library for embedding generation.
  • lancedb (docs) โ€” stores data locally or on LanceDB Cloud. Uses the lancedb library exclusively for all operations.

The lancedb destination will be phased out in favor of lance.

Setup guideโ€‹

Install dlt with lance dependenciesโ€‹

pip install "dlt[lance]"

Quick startโ€‹

import dlt

movies = [
{"id": 1, "title": "Blade Runner", "year": 1982},
{"id": 2, "title": "Ghost in the Shell", "year": 1995},
{"id": 3, "title": "The Matrix", "year": 1999},
]

pipeline = dlt.pipeline(
pipeline_name="movies",
destination="lance",
dataset_name="movies_db",
)

info = pipeline.run(movies, table_name="movies")

To add vector embeddings, wrap your data with lance_adapter โ€” see Embeddings below.

Storage configurationโ€‹

Configure storage in ~/.dlt/config.toml (or secrets/environment variables). The bucket_url determines the storage backend.

Local storage (default)โ€‹

If bucket_url is not configured, the current working directory is used.

[destination.lance.storage]
bucket_url = "/my/dir"

Cloud storageโ€‹

Cloud credentials use the same configuration fields as the filesystem destination, just under destination.lance.storage instead of destination.filesystem. Under the hood, credentials are passed to the object_store Rust crate (not fsspec), so some filesystem-specific options are not supported.

Amazon S3โ€‹

[destination.lance.storage]
bucket_url = "s3://my-bucket"

[destination.lance.storage.credentials]
aws_access_key_id = "AKIA..."
aws_secret_access_key = "..."
region_name = "us-east-1"

Google Cloud Storageโ€‹

[destination.lance.storage]
bucket_url = "gs://my-bucket"

[destination.lance.storage.credentials]
project_id = "my-project"
client_email = "...@...iam.gserviceaccount.com"
private_key = "-----BEGIN RSA PRIVATE KEY-----\n..."

Azure Blob Storageโ€‹

[destination.lance.storage]
bucket_url = "az://my-container"

[destination.lance.storage.credentials]
azure_storage_account_name = "myaccount"
azure_storage_account_key = "..."

Additional storage optionsโ€‹

You can pass storage-specific options via the options dict. These are forwarded to the object_store Rust crate. See the Lance Object Store Configuration docs for all available options.

For cloud storage, the following defaults are set automatically to prevent connection hangs:

OptionDefaultDescription
connect_timeout30sTCP connection timeout
timeout120sOverall request timeout

You can override these or add additional options:

[destination.lance.storage]
bucket_url = "s3://my-bucket"

[destination.lance.storage.options]
allow_http = "true"
timeout = "300s"

Catalog and storageโ€‹

The lance destination uses the Lance Directory Namespace (V2 Catalog Spec) to organize tables. Two concepts are configured separately:

  • Storage โ€” where table data files are written. Configured under [destination.lance.storage].
  • Catalog โ€” a __manifest table that tracks namespaces and tables. By default the catalog is colocated with storage (the __manifest lives under storage.bucket_url/storage.namespace_name). For advanced setups you can point the catalog at a separate location via [destination.lance.credentials] โ€” see Advanced: separate catalog location.

The logical layout of the default (colocated) case is:

bucket_url/
โ””โ”€โ”€ namespace_name/ โ† root namespace directory (default: "dlt_lance_root")
โ”œโ”€โ”€ __manifest/ โ† catalog tracking namespaces and tables
โ”œโ”€โ”€ <hash>_<dataset>$movies/ โ† lance table data
โ”œโ”€โ”€ <hash>_<dataset>$_dlt_version/
โ””โ”€โ”€ ...
  • Root namespace โ€” a physical directory at bucket_url/namespace_name. The namespace_name defaults to "dlt_lance_root" and can be set to "" to use bucket_url directly.
  • Dataset namespace โ€” a logical child namespace named after dataset_name, tracked in the __manifest/ catalog. Created automatically when the pipeline runs. All tables for the dataset are registered inside it.
  • Tables โ€” stored as hash-prefixed directories at the root namespace level, not nested under a dataset subdirectory.
[destination.lance.storage]
bucket_url = "s3://my-bucket"
namespace_name = "production" # root namespace subdirectory

Catalog capabilitiesโ€‹

Two capability flags control how the directory catalog tracks tables and namespaces. The defaults work for almost everyone:

[destination.lance.capabilities]
manifest_enabled = true
dir_listing_enabled = true
  • manifest_enabled (default true) โ€” enables the V2 catalog: a single __manifest Lance table at the root that tracks every namespace and table. Enables fast listing, nested namespaces, and multi-level table ids (which dlt uses to place tables under their dataset namespace). Recommended for single-writer or low-concurrency scenarios.
  • dir_listing_enabled (default true) โ€” enables the V1 fallback that discovers tables by scanning directories for .lance suffixes. Safe to leave on.

When to disable manifest_enabled: if many writers hit the same catalog root concurrently (for example, multiple pipelines or parallel jobs sharing one bucket_url/namespace_name), conflicting commits to the shared __manifest table on S3/GCS can cause contention and retries. Disabling the manifest eliminates the shared write point at the cost of slower listing and no nested-namespace support. If you disable it, give each pipeline run its own namespace_name to isolate datasets.

Branchingโ€‹

Lance datasets support branches โ€” lightweight version pointers for isolated reads and writes. Configure a branch name to direct all pipeline operations to that branch:

[destination.lance]
branch_name = "staging"

Or in Python:

import dlt

pipeline = dlt.pipeline(
destination=dlt.destinations.lance(branch_name="staging"),
dataset_name="my_data",
)

When branch_name is not set, the default main branch is used. Branches are created automatically on first write if they don't exist.

Branching is dataset-wide โ€” all tables, including dlt system tables (_dlt_version, _dlt_loads, _dlt_pipeline_state), are read from and written to the configured branch. This means each branch maintains its own pipeline state, schema history, and load metadata, providing full isolation between branches. Schemas can evolve independently in different branches.

Advanced: separate catalog locationโ€‹

By default the catalog __manifest lives under storage.bucket_url. You can put it in a completely different location โ€” for example on fast local storage while data stays on cheap object storage, or in a shared bucket while each team writes data to its own bucket. Populate [destination.lance.credentials] with its own bucket/credentials/options:

[destination.lance.storage]
bucket_url = "s3://data-bucket"

[destination.lance.credentials]
bucket_url = "s3://catalog-bucket/production"

[destination.lance.credentials.credentials]
aws_access_key_id = "AKIA..."
aws_secret_access_key = "..."
region_name = "us-east-1"

Any field left empty under credentials falls back to the corresponding storage value, so you only specify what actually differs. When credentials is omitted entirely, the catalog colocates with storage (the common case).

Write dispositionsโ€‹

All write dispositions are supported.

Appendโ€‹

The default. Inserts all records without updating or deleting existing data.

Replaceโ€‹

Replaces all data in the table using a truncate-and-insert strategy:

info = pipeline.run(movies, table_name="movies", write_disposition="replace")

Merge (upsert)โ€‹

Updates existing records and inserts new ones based on a unique identifier. Use lance_adapter to specify the merge_key:

from dlt.destinations.adapters import lance_adapter

pipeline.run(
lance_adapter(data, merge_key="doc_id"),
write_disposition={"disposition": "merge", "strategy": "upsert"},
primary_key=["doc_id", "chunk_id"],
)

The merge_key identifies the parent document. If merge_key is not specified, the first element of primary_key is used as fallback. When orphan removal is enabled (the default), only a single merge key is supported because the orphan deletion filter operates on a single column. To use compound merge keys, disable orphan removal with remove_orphans=False.

Orphan removalโ€‹

By default, when parent documents are updated or deleted during a merge, orphaned child records (chunks that no longer have a matching parent) are automatically removed. To disable this:

lance_adapter(data, merge_key="doc_id", remove_orphans=False)

Embeddings configurationโ€‹

To generate vector embeddings automatically, configure an embedding provider. The embedding generation is powered by the LanceDB embedding functions library.

[destination.lance.embeddings]
provider = "openai"
name = "text-embedding-3-small"
vector_column = "vector"
max_retries = 3

[destination.lance.embeddings.credentials]
api_key = "sk-..."

Any additional provider-specific arguments can be passed via kwargs:

[destination.lance.embeddings.kwargs]
api_base = "https://my-proxy.example.com/v1"

Then use lance_adapter to specify which columns should be embedded. The destination automatically adds a column named after vector_column (default: "vector") to store the generated embeddings:

from dlt.destinations.adapters import lance_adapter

info = pipeline.run(
lance_adapter(movies, embed=["title", "description"]),
table_name="movies",
)

Access loaded dataโ€‹

Standard dataset accessโ€‹

You can query loaded data using dlt's dataset access interface, which works the same way as with any other destination:

dataset = pipeline.dataset()
df = dataset["movies"].df()

Low-level Lance accessโ€‹

For operations specific to the Lance format โ€” such as version management, tagging, or direct reads โ€” use open_lance_dataset on the destination client. It returns a lance.LanceDataset from the lance library:

with pipeline.destination_client() as client:
ds = client.open_lance_dataset("movies") # type: ignore[attr-defined]
ds.create_tag("v1.0")
print(ds.tags())

You can also check out a specific branch or version:

with pipeline.destination_client() as client:
ds = client.open_lance_dataset("movies", branch_name="staging", version_number=5) # type: ignore[attr-defined]

For vector similarity search and other LanceDB-specific features, use open_lancedb_table. It returns a lancedb.table.LanceTable from the lancedb library:

with pipeline.destination_client() as client:
tbl = client.open_lancedb_table("movies") # type: ignore[attr-defined]
results = tbl.search("sci-fi classic").limit(5).to_list()

dbt supportโ€‹

The Lance destination does not support dbt integration.

Syncing of dlt stateโ€‹

The Lance destination supports syncing of the dlt state.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.