Databricks
Big thanks to Evan Phillips and swishbi.com for contributing code, time, and a test environment.
Install dlt with Databricksโ
To install the dlt library with Databricks dependencies:
pip install "dlt[databricks]"
Set up your Databricks workspaceโ
To use the Databricks destination, you need:
- A Databricks workspace with a Unity Catalog metastore connected
- A Gen 2 Azure storage account and container
If you already have your Databricks workspace set up, you can skip to the Loader setup guide.
1. Create a Databricks workspace in Azureโ
Create a Databricks workspace in Azure
In your Azure Portal, search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog.
Create an ADLS Gen 2 storage account
Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure it's a Data Lake Storage Gen 2 account by enabling "hierarchical namespace" when creating the account. Refer to the Azure documentation for further information.
Create a container in the storage account
In the storage account, create a new container. This will be used as a datastore for your Databricks catalog.
Create an Access Connector for Azure Databricks
This will allow Databricks to access your storage account. In the Azure Portal, search for "Access Connector for Azure Databricks" and create a new connector.
Grant access to your storage container
Navigate to the storage container you created earlier and select "Access control (IAM)" in the left-hand menu.
Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step.
2. Set up a metastore and Unity Catalogโ
Now go to your Databricks workspace
To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace".
In the top right corner, click on your email address and go to "Manage Account"
Go to "Data" and click on "Create Metastore"
Name your metastore and select a region. If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog.
In the next step, assign your metastore to your workspace.
Go back to your workspace and click on "Catalog" in the left-hand menu
Click "+ Add" and select "Add Storage Credential"
Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this:
/subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector_name>
Click "+ Add" again and select "Add external location"
Set the URL of your storage container. This should be in the form:
abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>
Once created, you can test the connection to make sure the container is accessible from Databricks.
Now you can create a catalog
Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step.
Authenticationโ
dlt
currently supports two options for authentication:
- OAuth2 (recommended) allows you to authenticate to Databricks using a service principal via OAuth2 M2M.
- Access token approach using a developer access token. This method may be deprecated in the future by Databricks.
Using OAuth2โ
You can authenticate to Databricks using a service principal via OAuth2 M2M. To enable it:
Follow the instructions in the Databricks documentation: Authenticate access to Databricks using OAuth M2M to create a service principal and retrieve the
client_id
andclient_secret
.Once you have the service principal credentials, update your credentials with any of the options shown below:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
client_id = "XXX"
client_secret = "XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID="XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_ID"]=os.environ.get("CLIENT_ID")
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CLIENT_SECRET"]=os.environ.get("CLIENT_SECRET")
Using access tokenโ
To create your access token:
- Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". Generate a new token and save it.
- Set up credentials in a desired way:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
catalog = "my_catalog"
access_token = "XXX"
export DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME="MY_DATABRICKS.azuredatabricks.net"
export DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH="/sql/1.0/warehouses/12345"
export DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG="my_catalog"
export DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__SERVER_HOSTNAME"] = "MY_DATABRICKS.azuredatabricks.net"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__HTTP_PATH"]="/sql/1.0/warehouses/12345"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__CATALOG"]="my_catalog"
os.environ["DESTINATIONS__DATABRICKS__CREDENTIALS__ACCESS_TOKEN"]=os.environ.get("ACCESS_TOKEN")
Loader setup guideโ
1. Initialize a project with a pipeline that loads to Databricks by running
dlt init chess databricks
2. Install the necessary dependencies for Databricks by running
pip install -r requirements.txt
This will install dlt with the databricks
extra, which contains the Databricks Python dbapi client.
3. Enter your credentials into .dlt/secrets.toml
.
This should include your connection parameters and your authentication credentials.
You can find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse"), and go to "Connection details".
Example:
[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
client_id = "XXX"
client_secret = "XXX"
catalog = "my_catalog"
You can find other options for specifying credentials in the Authentication section.
See Staging support for authentication options when dlt
copies files from buckets.
Write dispositionโ
All write dispositions are supported.
Data loadingโ
To load data into Databricks, you must set up a staging filesystem by configuring an Amazon S3 or Azure Blob Storage bucket. Parquet is the default file format used for data uploads. As an alternative to Parquet, you can switch to using JSONL.
dlt will upload the data in Parquet files (or JSONL, if configured) to the bucket and then use COPY INTO
statements to ingest the data into Databricks.
For more information on staging, see the Staging support section below.
Supported file formatsโ
- Parquet supported when staging is enabled.
- JSONL supported when staging is enabled (see limitations below).
The JSONL format has some limitations when used with Databricks:
- Compression must be disabled to load JSONL files in Databricks. Set
data_writer.disable_compression
totrue
in the dlt config when using this format. - The following data types are not supported when using the JSONL format with
databricks
:decimal
,json
,date
,binary
. Useparquet
if your data contains these types. - The
bigint
data type with precision is not supported with the JSONL format.
Staging supportโ
Databricks supports both Amazon S3, Azure Blob Storage, and Google Cloud Storage as staging locations. dlt
will upload files in Parquet format to the staging location and will instruct Databricks to load data from there.
Databricks and Amazon S3โ
Please refer to the S3 documentation for details on connecting your S3 bucket with the bucket_url
and credentials
.
Example to set up Databricks with S3 as a staging destination:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.filesystem]
bucket_url = "s3://your-bucket-name"
[destination.filesystem.credentials]
aws_access_key_id="XXX"
aws_secret_access_key="XXX"
export DESTINATIONS__FILESYSTEM__BUCKET_URL="s3://your-bucket-name"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "s3://your-bucket-name"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_ACCESS_KEY_ID"] = os.environ.get("AWS_ACCESS_KEY_ID")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AWS_SECRET_ACCESS_KEY"] = os.environ.get("AWS_SECRET_ACCESS_KEY")
Databricks and Azure Blob Storageโ
Refer to the Azure Blob Storage filesystem documentation for details on connecting your Azure Blob Storage container with the bucket_url
and credentials
.
To enable support for Azure Blob Storage with dlt, make sure to install the necessary dependencies by running:
pip install "dlt[az]"
Databricks requires that you use ABFS URLs in the following format: abfss://container_name@storage_account_name.dfs.core.windows.net/path
.
dlt is able to adapt the other representation (i.e., az://container-name/path
), but we recommend that you use the correct form.
Example to set up Databricks with Azure as a staging destination:
- TOML config provider
- Environment variables
- In the code
# secrets.toml
[destination.filesystem]
bucket_url = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"
[destination.filesystem.credentials]
azure_storage_account_name="XXX"
azure_storage_account_key="XXX"
export DESTINATIONS__FILESYSTEM__BUCKET_URL="abfss://container_name@storage_account_name.dfs.core.windows.net/path"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME="XXX"
export DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY="XXX"
import os
# Do not set up the secrets directly in the code!
# What you can do is reassign env variables.
os.environ["DESTINATIONS__FILESYSTEM__BUCKET_URL"] = "abfss://container_name@storage_account_name.dfs.core.windows.net/path"
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_NAME"] = os.environ.get("AZURE_STORAGE_ACCOUNT_NAME")
os.environ["DESTINATIONS__FILESYSTEM__CREDENTIALS__AZURE_STORAGE_ACCOUNT_KEY"] = os.environ.get("AZURE_STORAGE_ACCOUNT_KEY")
Databricks and Google Cloud Storageโ
In order to load from Google Cloud Storage stage, you must set up the credentials via a named credential. See below. Databricks does not allow you to pass Google Credentials explicitly in SQL statements.
Use external locations and stored credentialsโ
dlt
forwards bucket credentials to the COPY INTO
SQL command by default. You may prefer to use external locations or stored credentials instead that are stored on the Databricks side.
If you set up an external location for your staging path, you can tell dlt
to use it:
[destination.databricks]
is_staging_external_location=true
If you set up Databricks credentials named, for example, credential_x, you can tell dlt
to use them:
[destination.databricks]
staging_credentials_name="credential_x"
Both options are available from code:
import dlt
bricks = dlt.destinations.databricks(staging_credentials_name="credential_x")
Additional destination capabilitiesโ
dbt supportโ
This destination integrates with dbt via dbt-databricks.
Syncing of dlt
stateโ
This destination fully supports dlt state sync.
Databricks user agentโ
We enable Databricks to identify that the connection is created by dlt
.
Databricks will use this user agent identifier to better understand the usage patterns associated with dlt integration. The connection identifier is dltHub_dlt
.
Additional Setup guidesโ
- Load data from Crypt API to Databricks in python with dlt
- Load data from Notion to Databricks in python with dlt
- Load data from Qualtrics to Databricks in python with dlt
- Load data from IBM Db2 to Databricks in python with dlt
- Load data from Capsule CRM to Databricks in python with dlt
- Load data from MySQL to Databricks in python with dlt
- Load data from Google Cloud Storage to Databricks in python with dlt
- Load data from Azure Cloud Storage to Databricks in python with dlt
- Load data from Fivetran to Databricks in python with dlt
- Load data from Vimeo to Databricks in python with dlt