Version: devel View Markdown

Lance

Lance is an open-source columnar data format designed for AI/ML workloads, with native support for versioning, zero-copy access, and fast vector search. The lance destination lets you load data into Lance datasets stored on local disk or cloud object storage (S3, Azure, GCS).

Optionally, the destination can generate vector embeddings using the LanceDB embedding functions library.

Destination capabilities

The following table shows the capabilities of the Lance destination:

Feature	Value	More
Preferred loader file format	parquet	File formats
Supported loader file formats	parquet, reference	File formats
Supported merge strategies	upsert	Merge strategy
Supported replace strategies	truncate-and-insert	Replace strategy
Supports tz aware datetime	True	Timestamps and Timezones
Supports naive datetime	True	Timestamps and Timezones

This table shows the supported features of the Lance destination in dlt.

Lance vs. LanceDB destination

dlt ships two Lance-related destinations:

lance (this page) — stores data on local disk or cloud object storage (S3, GCS, Azure). Uses the lance library for table management and, optionally, the lancedb library for embedding generation.
lancedb (docs) — stores data locally or on LanceDB Cloud. Uses the lancedb library exclusively for all operations.

The lancedb destination will be phased out in favor of lance.

Setup guide

Install dlt with `lance` dependencies

pip install "dlt[lance]"

The lance extra requires Python 3.10+ and installs pylance>=6.0.1.

Quick start

import dlt

movies = [
    {"id": 1, "title": "Blade Runner", "year": 1982},
    {"id": 2, "title": "Ghost in the Shell", "year": 1995},
    {"id": 3, "title": "The Matrix", "year": 1999},
]

pipeline = dlt.pipeline(
    pipeline_name="movies",
    destination="lance",
    dataset_name="movies_db",
)

info = pipeline.run(movies, table_name="movies")

To add vector embeddings, wrap your data with lance_adapter — see Embeddings below.

Storage configuration

Configure storage in ~/.dlt/config.toml (or secrets/environment variables). The bucket_url determines the storage backend.

Local storage (default)

If bucket_url is not configured, the current working directory is used.

[destination.lance.storage]
bucket_url = "/my/dir"

Cloud storage

Cloud credentials use the same configuration fields as the filesystem destination, just under destination.lance.storage instead of destination.filesystem. Under the hood, credentials are passed to the object_store Rust crate (not fsspec), so some filesystem-specific options are not supported.

Amazon S3

[destination.lance.storage]
bucket_url = "s3://my-bucket"

[destination.lance.storage.credentials]
aws_access_key_id = "AKIA..."
aws_secret_access_key = "..."
region_name = "us-east-1"

Google Cloud Storage

[destination.lance.storage]
bucket_url = "gs://my-bucket"

[destination.lance.storage.credentials]
project_id = "my-project"
client_email = "...@...iam.gserviceaccount.com"
private_key = "-----BEGIN RSA PRIVATE KEY-----\n..."

Azure Blob Storage

[destination.lance.storage]
bucket_url = "az://my-container"

[destination.lance.storage.credentials]
azure_storage_account_name = "myaccount"
azure_storage_account_key = "..."

Additional storage options

You can pass storage-specific options via the options dict. These are forwarded to the object_store Rust crate. See the Lance Object Store Configuration docs for all available options.

For cloud storage, the following defaults are set automatically to prevent connection hangs:

Option	Default	Description
`connect_timeout`	`30s`	TCP connection timeout
`timeout`	`120s`	Overall request timeout

You can override these or add additional options:

[destination.lance.storage]
bucket_url = "s3://my-bucket"

[destination.lance.storage.options]
allow_http = "true"
timeout = "300s"

Catalog and storage

The lance destination uses a Lance Namespace as catalog. Two different namespace specs are currently supported:

Directory Namespace (V2 Catalog Spec) — used by default
REST Namespace — experimental support only

Directory Namespace

The destination uses a Directory Namespace by default. Two concepts are configured separately:

Storage — where table data files are written. Configured under [destination.lance.storage].
Catalog — a __manifest table that tracks namespaces and tables. By default the catalog is colocated with storage (the __manifest lives under storage.bucket_url/storage.namespace_name). For advanced setups you can point the catalog at a separate location via [destination.lance.credentials] — see Advanced: separate catalog location.

The logical layout of the default (colocated) case is:

bucket_url/
└── namespace_name/                ← root namespace directory (default: "dlt_lance_root")
    ├── __manifest/                ← catalog tracking namespaces and tables
    ├── <hash>_<dataset>$movies/   ← lance table data
    ├── <hash>_<dataset>$_dlt_version/
    └── ...

Root namespace — a physical directory at bucket_url/namespace_name. The namespace_name defaults to "dlt_lance_root" and can be set to "" to use bucket_url directly.
Dataset namespace — when dataset_name is set, a logical child namespace named after it is created automatically (tracked in the __manifest/ catalog) and all tables for the dataset are registered inside it. dataset_name is optional: when omitted, tables are created directly in the root namespace (single-level table ids) and no per-dataset child namespace is used.
Tables — stored as hash-prefixed directories at the root namespace level, not nested under a dataset subdirectory.

note

dataset_name is optional for lance. If you do not pass one, dlt does not auto-generate a dataset name and writes tables to the root namespace. Pass a dataset_name to isolate a pipeline's tables.

[destination.lance.storage]
bucket_url = "s3://my-bucket"
namespace_name = "production"  # root namespace subdirectory

Directory Namespace capabilities

Two capability flags control how the directory catalog tracks tables and namespaces. The defaults work for almost everyone:

[destination.lance.capabilities]
manifest_enabled = true
dir_listing_enabled = true

manifest_enabled (default true) — enables the V2 catalog: a single __manifest Lance table at the root that tracks every namespace and table. Enables fast listing, nested namespaces, and multi-level table ids (which dlt uses to place tables under their dataset namespace). Recommended for single-writer or low-concurrency scenarios.
dir_listing_enabled (default true) — enables the V1 fallback that discovers tables by scanning directories for .lance suffixes. Safe to leave on.

When to disable manifest_enabled: if many writers hit the same catalog root concurrently (for example, multiple pipelines or parallel jobs sharing one bucket_url/namespace_name), conflicting commits to the shared __manifest table on S3/GCS can cause contention and retries. Disabling the manifest eliminates the shared write point at the cost of slower listing and no nested-namespace support. If you disable it, give each pipeline run its own namespace_name to isolate datasets.

note

You can disable manifest_enabled only when the pipeline does not use a dataset_name: dlt creates the dataset as a child namespace, which requires manifest mode (loads fail with Child namespaces are only supported when manifest mode is enabled). Without a dataset_name, tables live in the root namespace and the catalog works with plain directory listing — no __manifest commits on table creation and no manifest reads when opening tables, which also makes loads noticeably faster on object stores.

REST Namespace (experimental)

warning

Lance REST Namespace support is an experimental feature.

To connect to a Lance REST Namespace server, set catalog_type = "rest" and provide the REST server URI. You may also need to set api_key and/or auth_token if the server requires authentication.

[destination.lance]
catalog_type = "rest"

[destination.lance.credentials]
uri = "http://127.0.0.1:2333"

# Optional auth, sent as HTTP headers
api_key = "..."      # sent as x-api-key
auth_token = "..."   # sent as Authorization: Bearer <auth_token>

Branching

Lance datasets support branches — lightweight version pointers for isolated reads and writes. Configure a branch name to direct all pipeline operations to that branch:

[destination.lance]
branch_name = "staging"

Or in Python:

import dlt

pipeline = dlt.pipeline(
    destination=dlt.destinations.lance(branch_name="staging"),
    dataset_name="my_data",
)

When branch_name is not set, the default main branch is used. Branches are created automatically on first write if they don't exist.

Branching is dataset-wide — all tables, including dlt system tables (_dlt_version, _dlt_loads, _dlt_pipeline_state), are read from and written to the configured branch. This means each branch maintains its own pipeline state, schema history, and load metadata, providing full isolation between branches. Schemas can evolve independently in different branches.

Advanced: separate catalog location

By default the catalog __manifest lives under storage.bucket_url. You can put it in a completely different location — for example on fast local storage while data stays on cheap object storage, or in a shared bucket while each team writes data to its own bucket. Populate [destination.lance.credentials] with its own bucket/credentials/options:

[destination.lance.storage]
bucket_url = "s3://data-bucket"

[destination.lance.credentials]
bucket_url = "s3://catalog-bucket/production"

[destination.lance.credentials.credentials]
aws_access_key_id = "AKIA..."
aws_secret_access_key = "..."
region_name = "us-east-1"

Any field left empty under credentials falls back to the corresponding storage value, so you only specify what actually differs. When credentials is omitted entirely, the catalog colocates with storage (the common case).

Write dispositions

All write dispositions are supported.

Each table receives a single lance commit per load, regardless of how many job files the load produces: load jobs write data fragments in parallel without committing and a followup job commits them in one atomic version. Readers never observe a partially loaded table and parallel jobs do not contend on dataset versions.

Append

The default. Inserts all records without updating or deleting existing data.

Replace

Replaces all data in the table with a single overwrite commit:

info = pipeline.run(movies, table_name="movies", write_disposition="replace")

Tables of a replaced resource that receive no data in a load (e.g. a nested table absent from the current run) are truncated before loading; tables receiving data are replaced atomically by their overwrite commit, without an intermediate truncation.

Merge (upsert)

Updates existing records and inserts new ones based on a unique identifier. Use lance_adapter to specify the merge_key:

from dlt.destinations.adapters import lance_adapter

pipeline.run(
    lance_adapter(data, merge_key="doc_id"),
    write_disposition={"disposition": "merge", "strategy": "upsert"},
    primary_key=["doc_id", "chunk_id"],
)

The merge_key identifies the parent document. If merge_key is not specified, the first element of primary_key is used as fallback. When orphan removal is enabled (the default), only a single merge key is supported because the orphan deletion filter operates on a single column. To use compound merge keys, disable orphan removal with remove_orphans=False.

Orphan removal

By default, when parent documents are updated or deleted during a merge, orphaned child records (chunks that no longer have a matching parent) are automatically removed. To disable this:

lance_adapter(data, merge_key="doc_id", remove_orphans=False)

Embeddings configuration

To generate vector embeddings automatically, configure an embedding provider. The embedding generation is powered by the LanceDB embedding functions library.

[destination.lance.embeddings]
provider = "openai"
name = "text-embedding-3-small"
vector_column = "vector"
max_retries = 3

[destination.lance.embeddings.credentials]
api_key = "sk-..."

Any additional provider-specific arguments can be passed via kwargs:

[destination.lance.embeddings.kwargs]
api_base = "https://my-proxy.example.com/v1"

Then use lance_adapter to specify which columns should be embedded. The destination automatically adds a column named after vector_column (default: "vector") to store the generated embeddings:

from dlt.destinations.adapters import lance_adapter

info = pipeline.run(
    lance_adapter(movies, embed=["title", "description"]),
    table_name="movies",
)

Access loaded data

Standard dataset access

You can query loaded data using dlt's dataset access interface, which works the same way as with any other destination:

dataset = pipeline.dataset()
df = dataset["movies"].df()

Reads go through an in-memory DuckDB instance that scans the Lance datasets via views. The DuckDB lance extension caches each dataset per connection at the version it was first opened, so a table read on an already-open connection does not pick up data written afterwards — neither new rows nor schema changes (new columns) are visible until the connection is refreshed. Enable always_refresh_views to refresh on every read; dlt then reopens the DuckDB connection (dropping the cached dataset) and recreates the scanner views, so reads observe the latest dataset version:

[destination.lance]
always_refresh_views = true

This adds a small overhead per read, so leave it disabled unless you read tables back after writing them through a long-lived dataset connection (or read evolving schemas).

Low-level Lance access

For operations specific to the Lance format — such as version management, tagging, or direct reads — use open_lance_dataset on the destination client. It returns a lance.LanceDataset from the lance library:

with pipeline.destination_client() as client:
    ds = client.open_lance_dataset("movies")  # type: ignore[attr-defined]
    ds.create_tag("v1.0")
    print(ds.tags())

You can also check out a specific branch or version:

with pipeline.destination_client() as client:
    ds = client.open_lance_dataset("movies", branch_name="staging", version_number=5)  # type: ignore[attr-defined]

LanceDB vector search

For vector similarity search and other LanceDB-specific features, use open_lancedb_table. It returns a lancedb.table.LanceTable from the lancedb library:

with pipeline.destination_client() as client:
    tbl = client.open_lancedb_table("movies")  # type: ignore[attr-defined]
    results = tbl.search("sci-fi classic").limit(5).to_list()

dbt support

The Lance destination does not support dbt integration.

Syncing of `dlt` state

The Lance destination supports syncing of the dlt state.

Lance

Destination capabilities

Setup guide

Install dlt with `lance` dependencies

Quick start

Storage configuration

Local storage (default)

Cloud storage

Amazon S3

Google Cloud Storage

Azure Blob Storage

Additional storage options

Catalog and storage

Directory Namespace

Directory Namespace capabilities

REST Namespace (experimental)

Branching

Advanced: separate catalog location

Write dispositions

Append

Replace

Merge (upsert)

Orphan removal

Embeddings configuration

Access loaded data

Standard dataset access

Low-level Lance access

LanceDB vector search

dbt support

Syncing of `dlt` state

DHelp

Ask a question

Destination capabilities​

Setup guide​

Install dlt with lance dependencies​

Quick start​

Storage configuration​

Local storage (default)​

Cloud storage​

Amazon S3​

Google Cloud Storage​

Azure Blob Storage​

Additional storage options​

Catalog and storage​

Directory Namespace​

Directory Namespace capabilities​

REST Namespace (experimental)​

Branching​

Advanced: separate catalog location​

Write dispositions​

Append​

Replace​

Merge (upsert)​

Orphan removal​

Embeddings configuration​

Access loaded data​

Standard dataset access​

Low-level Lance access​

LanceDB vector search​

dbt support​

Syncing of dlt state​

DHelp

Ask a question

Destination capabilities

Setup guide

Install dlt with `lance` dependencies

Quick start

Storage configuration

Local storage (default)

Cloud storage

Amazon S3

Google Cloud Storage

Azure Blob Storage

Additional storage options

Catalog and storage

Directory Namespace

Directory Namespace capabilities

REST Namespace (experimental)

Branching

Advanced: separate catalog location

Write dispositions

Append

Replace

Merge (upsert)

Orphan removal

Embeddings configuration

Access loaded data

Standard dataset access

Low-level Lance access

LanceDB vector search

dbt support

Syncing of `dlt` state