Version: devel

Databricks

Big thanks to Evan Phillips and swishbi.com for contributing code, time, and a test environment.

This Databricks destination will load your data into Databricks Delta tables using one of the supported cloud storage options. You can access your data using Unity Catalog.

There are two options to run dlt pipelines and load data:

Run dlt pipelines in any environment by providing credentials for both Databricks and your cloud storage.
Run dlt pipelines directly within Databricks notebooks without explicitly providing credentials.

note

If you'd like to load data to Databricks Managed Iceberg tables, use dlt+ Iceberg destination

Install dlt with Databricks

To install the dlt library with Databricks dependencies:

pip install "dlt[databricks]"

Set up your Databricks workspace

To use the Databricks destination, you need:

A Databricks workspace with a Unity Catalog metastore connected
A Gen 2 Azure storage account and container

If you already have your Databricks workspace set up, you can skip to the Loader setup guide.

1. Create a Databricks workspace in Azure

Create a Databricks workspace in Azure

In your Azure Portal, search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog.
Create an ADLS Gen 2 storage account

Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure it's a Data Lake Storage Gen 2 account by enabling "hierarchical namespace" when creating the account. Refer to the Azure documentation for further information.
Create a container in the storage account

In the storage account, create a new container. This will be used as a datastore for your Databricks catalog.
Create an Access Connector for Azure Databricks

This will allow Databricks to access your storage account. In the Azure Portal, search for "Access Connector for Azure Databricks" and create a new connector.
Grant access to your storage container

Navigate to the storage container you created earlier and select "Access control (IAM)" in the left-hand menu.

Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step.

2. Set up a metastore and Unity Catalog

Now go to your Databricks workspace

To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace".
In the top right corner, click on your email address and go to "Manage Account"
Go to "Data" and click on "Create Metastore"

Name your metastore and select a region. If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog.

In the next step, assign your metastore to your workspace.
Go back to your workspace and click on "Catalog" in the left-hand menu
Click "+ Add" and select "Add Storage Credential"

Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector_name>
Click "+ Add" again and select "Add external location"

Set the URL of your storage container. This should be in the form: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

Once created, you can test the connection to make sure the container is accessible from Databricks.
Now you can create a catalog

Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step.

Authentication

dlt currently supports two options for authentication:

OAuth2 (recommended) allows you to authenticate to Databricks using a service principal via OAuth2 M2M.
Access token approach using a developer access token. This method may be deprecated in the future by Databricks.

Using OAuth2

You can authenticate to Databricks using a service principal via OAuth2 M2M. To enable it:

Follow the instructions in the Databricks documentation: Authenticate access to Databricks using OAuth M2M to create a service principal and retrieve the client_id and client_secret.
Once you have the service principal credentials, update your credentials with any of the options shown below:

TOML config provider
Environment variables
In the code

# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
client_id = "XXX"
client_secret = "XXX"

export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID="XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET="XXX"

import os

# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID"]=os.environ.get("CLIENT_ID")
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET"]=os.environ.get("CLIENT_SECRET")

Using access token

To create your access token:

Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". Generate a new token and save it.
Set up credentials in a desired way:

TOML config provider
Environment variables
In the code

# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
access_token = "XXX"

export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN="XXX"

import os

# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN"]=os.environ.get("ACCESS_TOKEN")

Loader setup guide

1. Initialize a project with a pipeline that loads to Databricks by running

dlt init chess databricks

2. Install the necessary dependencies for Databricks by running

pip install -r requirements.txt

This will install dlt with the databricks extra, which contains the Databricks Python dbapi client.

3. Enter your credentials into .dlt/secrets.toml.

This should include your connection parameters and your authentication credentials.

You can find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse"), and go to "Connection details".

Example:

[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
client_id = "XXX"
client_secret = "XXX"
catalog = "my_catalog"

You can find other options for specifying credentials in the Authentication section.

See Staging support for authentication options when dlt copies files from buckets.

Using default credentials

If none of auth methods above is configured, dlt attempts to get authorization from the Databricks workspace context. The context may come, for example, from a Notebook (runtime) or via standard set of env variables that Databricks Python sdk recognizes (ie. DATABRICKS_TOKEN or DATABRICKS_HOST)

dlt is able to set server_hostname and http_path from available warehouses. We use default warehouse id (DATABRICKS_WAREHOUSE_ID) if set (via env variable), or a first one on warehouse's list.

Write disposition

All write dispositions are supported.

Data loading

To load data into Databricks, you must set up a staging filesystem by configuring an Amazon S3 or Azure Blob Storage bucket. Parquet is the default file format used for data uploads. As an alternative to Parquet, you can switch to using JSONL.

dlt will upload the data in Parquet files (or JSONL, if configured) to the bucket and then use COPY INTO statements to ingest the data into Databricks.

For more information on staging, see the Staging support section below.

Supported file formats

Parquet supported when staging is enabled.
JSONL supported when staging is enabled (see limitations below).

The JSONL format has some limitations when used with Databricks:

The following data types are not supported when using the JSONL format with databricks: decimal, json, date, binary. Use parquet if your data contains these types.
The bigint data type with precision is not supported with the JSONL format.

Direct Load (Databricks Managed Volumes)

dlt now supports Direct Load, enabling pipelines to run seamlessly from Databricks Notebooks without external staging. When executed in a Databricks Notebook, dlt uses the notebook context for configuration if not explicitly provided.

Direct Load also works outside Databricks, requiring explicit configuration of server_hostname, http_path, catalog, and authentication (client_id/client_secret for OAuth or access_token for token-based authentication).

The example below demonstrates how to load data directly from a Databricks Notebook. Simply specify the Databricks catalog and optionally a fully qualified volume name (recommended for production) – the remaining configuration comes from the notebook context:

import dlt
from dlt.destinations import databricks
from dlt.sources.rest_api import rest_api_source

# Fully qualified Databricks managed volume (recommended for production)
# - dlt assumes the named volume already exists
staging_volume_name = "dlt_ci.dlt_tests_shared.static_volume"

bricks = databricks(credentials={"catalog": "dlt_ci"}, staging_volume_name=staging_volume_name)

pokemon_source = rest_api_source(
    {
        "client": {"base_url": "https://pokeapi.co/api/v2/"},
        "resource_defaults": {"endpoint": {"params": {"limit": 1000}}},
        "resources": ["pokemon"],
    }
)

pipeline = dlt.pipeline(
    pipeline_name="rest_api_example",
    dataset_name="rest_api_data",
    destination=bricks,
)

load_info = pipeline.run(pokemon_source)
print(load_info)
print(pipeline.dataset().pokemon.df())

If no staging_volume_name is provided, dlt creates a default volume automatically.
For production, explicitly setting staging_volume_name is recommended.
The volume is used as a temporary location to store files before loading.

Module conflict

When using dlt within Databricks Notebooks, you may encounter naming conflicts with Databricks' built-in Delta Live Tables (DLT) module. To avoid these conflicts, follow the steps in the Troubleshooting section below.

tip

You can delete staged files immediately after loading by setting the following config option:

[destination.databricks]
keep_staged_files = false

Supported hints

Supported table hints

Databricks supports the following table hints:

description - Uses the description to add comment to the table. This can also be done by using the adapter parameter table_comment.

Databricks supports the following column hints:

primary_key - adds a primary key constraint to the column in Unity Catalog.
description - adds a description to the column. This can also be done by using the adapter parameter table_comment.
references - adds a foreign key constraint to the column in Unity Catalog.
not_null - adds a not null constraint to the column.
cluster - adds a clustering constraint to the column. This can also be done by using the adapter parameter cluster.

note

If you want to enforce constraints on the tables, you can set the create_indexes option to true. This will add PRIMARY KEY and FOREIGN KEY constraints to the tables if the hints primary key and references are set.

[destination.databricks]
# Add PRIMARY KEY and FOREIGN KEY constraints to tables
create_indexes=true

For additional hints specific to Databricks, see the Databricks adapter section.

Staging support

Databricks supports both Amazon S3, Azure Blob Storage, and Google Cloud Storage as staging locations. dlt will upload files in Parquet format to the staging location and will instruct Databricks to load data from there.

Databricks and Amazon S3

Please refer to the S3 documentation for details on connecting your S3 bucket with the bucket_url and credentials.

Example to set up Databricks with S3 as a staging destination:

TOML config provider
Environment variables
In the code

# secrets.toml
[destination.filesystem]
bucket_url = "s3://your-bucket-name"

[destination.filesystem.credentials]
aws_access_key_id="XXX"
aws_secret_access_key="XXX"

export DESTINATIONS__FILESYSTEM__BUCKET_URL="s3://your-bucket-name"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="XXX"

import os

# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "s3://your-bucket-name"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID"] = os.environ.get("AWS_ACCESS_KEY_ID")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY"] = os.environ.get("AWS_SECRET_ACCESS_KEY")

Databricks and Azure Blob Storage

Refer to the Azure Blob Storage filesystem documentation for details on connecting your Azure Blob Storage container with the bucket_url and credentials.

To enable support for Azure Blob Storage with dlt, make sure to install the necessary dependencies by running:

pip install "dlt[az]"

note

Databricks requires that you use ABFS URLs in the following format: abfss://container_name@storage_account_name.dfs.core.windows.net/path. dlt is able to adapt the other representation (i.e., az://container-name/path), but we recommend that you use the correct form.

Example to set up Databricks with Azure as a staging destination:

TOML config provider
Environment variables
In the code

# secrets.toml
[destination.filesystem]
bucket_url = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"

[destination.filesystem.credentials]
azure_storage_account_name="XXX"
azure_storage_account_key="XXX"

export DESTINATIONS__FILESYSTEM__BUCKET_URL="abfss://container_name@storage_account_name.dfs.core.windows.net/path"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="XXX"

import os

# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME"] = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY"] = os.environ.get("AZURE_STORAGE_ACCOUNT_KEY")

Databricks and Google Cloud Storage

In order to load from Google Cloud Storage stage, you must set up the credentials via a named credential. See below. Databricks does not allow you to pass Google Credentials explicitly in SQL statements.

Use external locations and stored credentials

dlt forwards bucket credentials to the COPY INTO SQL command by default. You may prefer to use external locations or stored credentials instead that are stored on the Databricks side.

If you set up an external location for your staging path, you can tell dlt to use it:

[destination.databricks]
is_staging_external_location=true

If you set up Databricks credentials named, for example, credential_x, you can tell dlt to use them:

[destination.databricks]
staging_credentials_name="credential_x"

Both options are available from code:

import dlt

bricks = dlt.destinations.databricks(staging_credentials_name="credential_x")

Additional destination capabilities

dbt support

This destination integrates with dbt via dbt-databricks.

Syncing of `dlt` state

This destination fully supports dlt state sync.

Databricks user agent

We enable Databricks to identify that the connection is created by dlt. Databricks will use this user agent identifier to better understand the usage patterns associated with dlt integration. The connection identifier is dltHub_dlt.

Databricks adapter

You can use the databricks_adapter function to add Databricks-specific hints to a resource. These hints influence how data is loaded into Databricks tables, such as adding comments and tags. Hints can be defined at both the column level and table level.

The adapter updates the DltResource with metadata about the destination column and table DDL options.

Supported hints

table_comment: adds a comment to the table. Supports basic markdown format basic-syntax.
table_tags: adds tags to the table. Supports a list of strings and/or key-value pairs.
column_hints
- column_comment: adds a comment to the column. Supports basic markdown format basic-syntax.
- column_tags: adds tags to the column. Supports a list of strings and/or key-value pairs.

Use an adapter to apply hints to a resource

Here is an example of how to use the databricks_adapter function to apply hints to a resource on both the column level and table level:

import dlt
from dlt.destinations.adapters import databricks_adapter

@dlt.resource(
    columns=[
        {"name": "event_date", "data_type": "date"},
        {"name": "user_id", "data_type": "bigint"},
        # Other columns.
    ]
)
def event_data():
    yield from [
        {"event_date": datetime.date.today() + datetime.timedelta(days=i)} for i in range(100)
    ]

# Apply table and column options.
databricks_adapter(
    event_data,

    # Table level options.
    table_comment="Dummy event data.",
    table_tags=["pii", {"cost_center": "12345"}],

    # Column level options.
    column_hints={
        "event_date": {"column_comment": "The date of the event"},
        "user_id": {
            "column_comment": "The id of the user",
            "column_tags": ["pii", {"cost_center": "12345"}]
        },
    },
)

Troubleshooting

Use the following steps to avoid conflicts with Databricks' built-in Delta Live Tables (DLT) module and enable dltHub integration.

1. Add an `init` script

To ensure compatibility with the dltHub's dlt package in Databricks, add an init script that runs at cluster startup. This script installs the dlt package from dltHub, renames Databricks’ built-in DLT module to avoid naming conflicts, and updates internal references to allow continued use under the alias dlt_dbricks.

In your Databricks workspace directory, create a new file named init.sh and add the following content:

#! /bin/bash

# move Databricks' dlt package to a different folder name
mv /databricks/spark/python/dlt/ /databricks/spark/python/dlt_dbricks

# Replace all mentions of `dlt` with `dlt_dbricks` so that Databricks' dlt
# can be used as `dlt_dbricks` in the notebook instead
find /databricks/spark/python/dlt_dbricks/ -type f -exec sed -i 's/from dlt/from dlt_dbricks/g' {} \;

# Replace mentions of `dlt` with `dlt_dbricks` in DeltaLiveTablesHook.py to
# avoid import errors
sed -i "s/'dlt'/'dlt_dbricks'/g" /databricks/python_shell/dbruntime/DeltaLiveTablesHook.py
sed -i "s/from dlt/from dlt_dbricks/g" /databricks/python_shell/dbruntime/DeltaLiveTablesHook.py

# Install dltHub dlt
pip install dlt

Go to your cluster, click Edit, scroll down to Advanced Options, and open the Init Scripts tab.
Under Source, choose Workspace, then browse to your init.sh file and click Add.
Click Confirm to apply the change. The cluster will restart automatically.

2. Remove preloaded databricks modules in the notebook

After the cluster starts, Databricks may partially import its built-in Delta Live Tables (DLT) modules, which can interfere with the dlt package from dltHub.

To ensure a clean environment, add the following code at the top of your notebook:

import sys
import types

# 1 Drop Databricks' post-import hook
sys.meta_path = [h for h in sys.meta_path if 'PostImportHook' not in repr(h)]

# 2️ Purge half-initialized Delta-Live-Tables modules
for name, module in list(sys.modules.items()):
    if not isinstance(module, types.ModuleType):
        continue
    if getattr(module, '__file__', '').startswith('/databricks/spark/python/dlt'):
        del sys.modules[name]

This ensures the dlt package from dltHub is used instead of the built-in Databricks DLT module.

caution

It is best practice to use an init.sh script.

Modifying sys.meta_path or sys.modules is fragile and may break after Databricks updates, potentially causing unexpected issues. If this workaround is necessary, validate your setup after each platform upgrade.

Databricks

Install dlt with Databricks

Set up your Databricks workspace

1. Create a Databricks workspace in Azure

2. Set up a metastore and Unity Catalog

Authentication

Using OAuth2

Using access token

Loader setup guide

Using default credentials

Write disposition

Data loading

Supported file formats

Direct Load (Databricks Managed Volumes)

Supported hints

Supported table hints

Staging support

Databricks and Amazon S3

Databricks and Azure Blob Storage

Databricks and Google Cloud Storage

Use external locations and stored credentials

Additional destination capabilities

dbt support

Syncing of `dlt` state

Databricks user agent

Databricks adapter

Supported hints

Use an adapter to apply hints to a resource

Troubleshooting

1. Add an `init` script

2. Remove preloaded databricks modules in the notebook

Additional Setup guides

DHelp

Ask a question

Install dlt with Databricks​

Set up your Databricks workspace​

1. Create a Databricks workspace in Azure​

2. Set up a metastore and Unity Catalog​

Authentication​

Using OAuth2​

Using access token​

Loader setup guide​

Using default credentials​

Write disposition​

Data loading​

Supported file formats​

Direct Load (Databricks Managed Volumes)​

Supported hints​

Supported table hints​

Staging support​

Databricks and Amazon S3​

Databricks and Azure Blob Storage​

Databricks and Google Cloud Storage​

Use external locations and stored credentials​

Additional destination capabilities​

dbt support​

Syncing of dlt state​

Databricks user agent​

Databricks adapter​

Supported hints​

Use an adapter to apply hints to a resource​

Troubleshooting​

1. Add an init script​

2. Remove preloaded databricks modules in the notebook​

Additional Setup guides​

DHelp

Ask a question

Install dlt with Databricks

Set up your Databricks workspace

1. Create a Databricks workspace in Azure

2. Set up a metastore and Unity Catalog

Authentication

Using OAuth2

Using access token

Loader setup guide

Using default credentials

Write disposition

Data loading

Supported file formats

Direct Load (Databricks Managed Volumes)

Supported hints

Supported table hints

Staging support

Databricks and Amazon S3

Databricks and Azure Blob Storage

Databricks and Google Cloud Storage

Use external locations and stored credentials

Additional destination capabilities

dbt support

Syncing of `dlt` state

Databricks user agent

Databricks adapter

Supported hints

Use an adapter to apply hints to a resource

Troubleshooting

1. Add an `init` script

2. Remove preloaded databricks modules in the notebook

Additional Setup guides