Databricks
Big thanks to Evan Phillips and swishbi.com for contributing code, time, and a test environment.
This Databricks destination will load your data into Databricks Delta tables using one of the supported cloud storage options. You can access your data using Unity Catalog.
There are two options to run dlt pipelines and load data:
- Run dlt pipelines in any environment by providing credentials for both Databricks and your cloud storage.
- Run dlt pipelines directly within Databricks notebooks without explicitly providing credentials.
If you'd like to load data to Databricks Managed Iceberg tables, use dlt+ Iceberg destination
Install dlt with Databricksβ
To install the dlt library with Databricks dependencies:
pip install "dlt[databricks]"
Set up your Databricks workspaceβ
To use the Databricks destination, you need:
- A Databricks workspace with a Unity Catalog metastore connected
- A Gen 2 Azure storage account and container
If you already have your Databricks workspace set up, you can skip to the Loader setup guide.
1. Create a Databricks workspace in Azureβ
-
Create a Databricks workspace in Azure
In your Azure Portal, search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog.
-
Create an ADLS Gen 2 storage account
Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure it's a Data Lake Storage Gen 2 account by enabling "hierarchical namespace" when creating the account. Refer to the Azure documentation for further information.
-
Create a container in the storage account
In the storage account, create a new container. This will be used as a datastore for your Databricks catalog.
-
Create an Access Connector for Azure Databricks
This will allow Databricks to access your storage account. In the Azure Portal, search for "Access Connector for Azure Databricks" and create a new connector.
-
Grant access to your storage container
Navigate to the storage container you created earlier and select "Access control (IAM)" in the left-hand menu.
Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step.
2. Set up a metastore and Unity Catalogβ
-
Now go to your Databricks workspace
To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace".
-
In the top right corner, click on your email address and go to "Manage Account"
-
Go to "Data" and click on "Create Metastore"
Name your metastore and select a region. If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog.
In the next step, assign your metastore to your workspace.
-
Go back to your workspace and click on "Catalog" in the left-hand menu
-
Click "+ Add" and select "Add Storage Credential"
Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this:
/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector_name>
-
Click "+ Add" again and select "Add external location"
Set the URL of your storage container. This should be in the form:
abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>
Once created, you can test the connection to make sure the container is accessible from Databricks.
-
Now you can create a catalog
Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step.
Authenticationβ
dlt
currently supports two options for authentication:
- OAuth2 (recommended) allows you to authenticate to Databricks using a service principal via OAuth2 M2M.
- Access token approach using a developer access token. This method may be deprecated in the future by Databricks.
Using OAuth2β
You can authenticate to Databricks using a service principal via OAuth2 M2M. To enable it:
-
Follow the instructions in the Databricks documentation: Authenticate access to Databricks using OAuth M2M to create a service principal and retrieve the
client_id
andclient_secret
. -
Once you have the service principal credentials, update your credentials with any of the options shown below:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
client_id = "XXX"
client_secret = "XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID="XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID"]=os.environ.get("CLIENT_ID")
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET"]=os.environ.get("CLIENT_SECRET")
Using access tokenβ
To create your access token:
- Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". Generate a new token and save it.
- Set up credentials in a desired way:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
access_token = "XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN"]=os.environ.get("ACCESS_TOKEN")
Loader setup guideβ
1. Initialize a project with a pipeline that loads to Databricks by running
dlt init chess databricks
2. Install the necessary dependencies for Databricks by running
pip install -r requirements.txt
This will install dlt with the databricks
extra, which contains the Databricks Python dbapi client.
3. Enter your credentials into .dlt/secrets.toml
.
This should include your connection parameters and your authentication credentials.
You can find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse"), and go to "Connection details".
Example:
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
client_id = "XXX"
client_secret = "XXX"
catalog = "my_catalog"
You can find other options for specifying credentials in the Authentication section.
See Staging support for authentication options when dlt
copies files from buckets.
Using default credentialsβ
If none of auth methods above is configured, dlt
attempts to get authorization from the Databricks workspace context. The context may
come, for example, from a Notebook (runtime) or via standard set of env variables that Databricks Python sdk recognizes (ie. DATABRICKS_TOKEN or DATABRICKS_HOST)
dlt
is able to set server_hostname
and http_path
from available warehouses. We use default warehouse id (DATABRICKS_WAREHOUSE_ID)
if set (via env variable), or a first one on warehouse's list.
Write dispositionβ
All write dispositions are supported.
Data loadingβ
To load data into Databricks, you must set up a staging filesystem by configuring an Amazon S3 or Azure Blob Storage bucket. Parquet is the default file format used for data uploads. As an alternative to Parquet, you can switch to using JSONL.
dlt will upload the data in Parquet files (or JSONL, if configured) to the bucket and then use COPY INTO
statements to ingest the data into Databricks.
For more information on staging, see the Staging support section below.
Supported file formatsβ
- Parquet supported when staging is enabled.
- JSONL supported when staging is enabled (see limitations below).
The JSONL format has some limitations when used with Databricks:
- Compression must be disabled to load JSONL files in Databricks. Set
data_writer.disable_compression
totrue
in the dlt config when using this format. - The following data types are not supported when using the JSONL format with
databricks
:decimal
,json
,date
,binary
. Useparquet
if your data contains these types. - The
bigint
data type with precision is not supported with the JSONL format.
Direct Load (Databricks Managed Volumes)β
dlt
now supports Direct Load, enabling pipelines to run seamlessly from Databricks Notebooks without external staging. When executed in a Databricks Notebook, dlt
uses the notebook context for configuration if not explicitly provided.
Direct Load also works outside Databricks, requiring explicit configuration of server_hostname
, http_path
, catalog
, and authentication (client_id
/client_secret
for OAuth or access_token
for token-based authentication).
The example below demonstrates how to load data directly from a Databricks Notebook. Simply specify the Databricks catalog and optionally a fully qualified volume name (recommended for production) β the remaining configuration comes from the notebook context:
import dlt
from dlt.destinations import databricks
from dlt.sources.rest_api import rest_api_source
# Fully qualified Databricks managed volume (recommended for production)
# - dlt assumes the named volume already exists
staging_volume_name = "dlt_ci.dlt_tests_shared.static_volume"
bricks = databricks(credentials={"catalog": "dlt_ci"}, staging_volume_name=staging_volume_name)
pokemon_source = rest_api_source(
{
"client": {"base_url": "https://pokeapi.co/api/v2/"},
"resource_defaults": {"endpoint": {"params": {"limit": 1000}}},
"resources": ["pokemon"],
}
)
pipeline = dlt.pipeline(
pipeline_name="rest_api_example",
dataset_name="rest_api_data",
destination=bricks,
)
load_info = pipeline.run(pokemon_source)
print(load_info)
print(pipeline.dataset().pokemon.df())
- If no staging_volume_name is provided, dlt creates a default volume automatically.
- For production, explicitly setting staging_volume_name is recommended.
- The volume is used as a temporary location to store files before loading.
When using dlt within Databricks Notebooks, you may encounter naming conflicts with Databricks' built-in Delta Live Tables (DLT) module. To avoid these conflicts, follow the steps in the Troubleshooting section below.
You can delete staged files immediately after loading by setting the following config option:
[destination.databricks]
keep_staged_files = false
Supported hintsβ
Supported table hintsβ
Databricks supports the following table hints:
description
- Uses the description to add comment to the table. This can also be done by using the adapter parametertable_comment
.
Databricks supports the following column hints:
primary_key
- adds a primary key constraint to the column in Unity Catalog.description
- adds a description to the column. This can also be done by using the adapter parametertable_comment
.references
- adds a foreign key constraint to the column in Unity Catalog.not_null
- adds a not null constraint to the column.cluster
- adds a clustering constraint to the column. This can also be done by using the adapter parametercluster
.
If you want to enforce constraints on the tables, you can set the create_indexes
option to true
. This will add PRIMARY KEY and FOREIGN KEY constraints to the tables if the hints primary key and references are set.
[destination.databricks]
# Add PRIMARY KEY and FOREIGN KEY constraints to tables
create_indexes=true
For additional hints specific to Databricks, see the Databricks adapter section.
Staging supportβ
Databricks supports both Amazon S3, Azure Blob Storage, and Google Cloud Storage as staging locations. dlt
will upload files in Parquet format to the staging location and will instruct Databricks to load data from there.
Databricks and Amazon S3β
Please refer to the S3 documentation for details on connecting your S3 bucket with the bucket_url
and credentials
.
Example to set up Databricks with S3 as a staging destination:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.filesystem]
bucket_url = "s3://your-bucket-name"
[destination.filesystem.credentials]
aws_access_key_id="XXX"
aws_secret_access_key="XXX"
export DESTINATIONS__FILESYSTEM__BUCKET_URL="s3://your-bucket-name"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "s3://your-bucket-name"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID"] = os.environ.get("AWS_ACCESS_KEY_ID")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY"] = os.environ.get("AWS_SECRET_ACCESS_KEY")
Databricks and Azure Blob Storageβ
Refer to the Azure Blob Storage filesystem documentation for details on connecting your Azure Blob Storage container with the bucket_url
and credentials
.
To enable support for Azure Blob Storage with dlt, make sure to install the necessary dependencies by running:
pip install "dlt[az]"
Databricks requires that you use ABFS URLs in the following format: abfss://container_name@storage_account_name.dfs.core.windows.net/path
.
dlt is able to adapt the other representation (i.e., az://container-name/path
), but we recommend that you use the correct form.
Example to set up Databricks with Azure as a staging destination:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.filesystem]
bucket_url = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"
[destination.filesystem.credentials]
azure_storage_account_name="XXX"
azure_storage_account_key="XXX"
export DESTINATIONS__FILESYSTEM__BUCKET_URL="abfss://container_name@storage_account_name.dfs.core.windows.net/path"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME"] = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY"] = os.environ.get("AZURE_STORAGE_ACCOUNT_KEY")
Databricks and Google Cloud Storageβ
In order to load from Google Cloud Storage stage, you must set up the credentials via a named credential. See below. Databricks does not allow you to pass Google Credentials explicitly in SQL statements.
Use external locations and stored credentialsβ
dlt
forwards bucket credentials to the COPY INTO
SQL command by default. You may prefer to use external locations or stored credentials instead that are stored on the Databricks side.
If you set up an external location for your staging path, you can tell dlt
to use it:
[destination.databricks]
is_staging_external_location=true
If you set up Databricks credentials named, for example, credential_x, you can tell dlt
to use them:
[destination.databricks]
staging_credentials_name="credential_x"
Both options are available from code:
import dlt
bricks = dlt.destinations.databricks(staging_credentials_name="credential_x")
Additional destination capabilitiesβ
dbt supportβ
This destination integrates with dbt via dbt-databricks.
Syncing of dlt
stateβ
This destination fully supports dlt state sync.
Databricks user agentβ
We enable Databricks to identify that the connection is created by dlt
.
Databricks will use this user agent identifier to better understand the usage patterns associated with dlt integration. The connection identifier is dltHub_dlt
.
Databricks adapterβ
You can use the databricks_adapter
function to add Databricks-specific hints to a resource. These hints influence how data is loaded into Databricks tables, such as adding comments and tags. Hints can be defined at both the column level and table level.
The adapter updates the DltResource with metadata about the destination column and table DDL options.
Supported hintsβ
table_comment
: adds a comment to the table. Supports basic markdown format basic-syntax.table_tags
: adds tags to the table. Supports a list of strings and/or key-value pairs.column_hints
column_comment
: adds a comment to the column. Supports basic markdown format basic-syntax.column_tags
: adds tags to the column. Supports a list of strings and/or key-value pairs.
Use an adapter to apply hints to a resourceβ
Here is an example of how to use the databricks_adapter
function to apply hints to a resource on both the column level and table level:
import dlt
from dlt.destinations.adapters import databricks_adapter
@dlt.resource(
columns=[
{"name": "event_date", "data_type": "date"},
{"name": "user_id", "data_type": "bigint"},
# Other columns.
]
)
def event_data():
yield from [
{"event_date": datetime.date.today() + datetime.timedelta(days=i)} for i in range(100)
]
# Apply table and column options.
databricks_adapter(
event_data,
# Table level options.
table_comment="Dummy event data.",
table_tags=["pii", {"cost_center": "12345"}],
# Column level options.
column_hints={
"event_date": {"column_comment": "The date of the event"},
"user_id": {
"column_comment": "The id of the user",
"column_tags": ["pii", {"cost_center": "12345"}]
},
},
)
Troubleshootingβ
Use the following steps to avoid conflicts with Databricks' built-in Delta Live Tables (DLT) module and enable dltHub integration.
1. Add an init
scriptβ
To ensure compatibility with the dltHub's dlt package in Databricks, add an init
script that runs at cluster startup. This script installs the dlt package from dltHub, renames Databricksβ built-in DLT module to avoid naming conflicts, and updates internal references to allow continued use under the alias dlt_dbricks
.
- In your Databricks workspace directory, create a new file named
init.sh
and add the following content:
#! /bin/bash
# move Databricks' dlt package to a different folder name
mv /databricks/spark/python/dlt/ /databricks/spark/python/dlt_dbricks
# Replace all mentions of `dlt` with `dlt_dbricks` so that Databricks' dlt
# can be used as `dlt_dbricks` in the notebook instead
find /databricks/spark/python/dlt_dbricks/ -type f -exec sed -i 's/from dlt/from dlt_dbricks/g' {} \;
# Replace mentions of `dlt` with `dlt_dbricks` in DeltaLiveTablesHook.py to
# avoid import errors
sed -i "s/'dlt'/'dlt_dbricks'/g" /databricks/python_shell/dbruntime/DeltaLiveTablesHook.py
sed -i "s/from dlt/from dlt_dbricks/g" /databricks/python_shell/dbruntime/DeltaLiveTablesHook.py
# Install dltHub dlt
pip install dlt
-
Go to your cluster, click Edit, scroll down to Advanced Options, and open the Init Scripts tab.
-
Under Source, choose Workspace, then browse to your
init.sh
file and click Add. -
Click Confirm to apply the change. The cluster will restart automatically.
2. Remove preloaded databricks modules in the notebookβ
After the cluster starts, Databricks may partially import its built-in Delta Live Tables (DLT) modules, which can interfere with the dlt package from dltHub.
To ensure a clean environment, add the following code at the top of your notebook:
import sys
import types
# 1 Drop Databricks' post-import hook
sys.meta_path = [h for h in sys.meta_path if 'PostImportHook' not in repr(h)]
# 2οΈ Purge half-initialized Delta-Live-Tables modules
for name, module in list(sys.modules.items()):
if not isinstance(module, types.ModuleType):
continue
if getattr(module, '__file__', '').startswith('/databricks/spark/python/dlt'):
del sys.modules[name]
This ensures the dlt package from dltHub is used instead of the built-in Databricks DLT module.
It is best practice to use an init.sh
script.
Modifying sys.meta_path
or sys.modules
is fragile and may break after Databricks updates, potentially causing unexpected issues.
If this workaround is necessary, validate your setup after each platform upgrade.
Additional Setup guidesβ
- Load data from Azure Cloud Storage to Databricks in python with dlt
- Load data from Oracle Database to Databricks in python with dlt
- Load data from Spotify to Databricks in python with dlt
- Load data from MySQL to Databricks in python with dlt
- Load data from Looker to Databricks in python with dlt
- Load data from Zuora to Databricks in python with dlt
- Load data from X to Databricks in python with dlt
- Load data from CircleCI to Databricks in python with dlt
- Load data from Bitbucket to Databricks in python with dlt
- Load data from Keap to Databricks in python with dlt