Skip to main content

Databricks

Big thanks to Evan Phillips and swishbi.com for contributing code, time, and a test environment.

Install dlt with Databricksโ€‹

To install the dlt library with Databricks dependencies:

pip install "dlt[databricks]"

Set up your Databricks workspaceโ€‹

To use the Databricks destination, you need:

  • A Databricks workspace with a Unity Catalog metastore connected
  • A Gen 2 Azure storage account and container

If you already have your Databricks workspace set up, you can skip to the Loader setup Guide.

1. Create a Databricks workspace in Azureโ€‹

  1. Create a Databricks workspace in Azure

    In your Azure Portal, search for Databricks and create a new workspace. In the "Pricing Tier" section, select "Premium" to be able to use the Unity Catalog.

  2. Create an ADLS Gen 2 storage account

    Search for "Storage accounts" in the Azure Portal and create a new storage account. Make sure it's a Data Lake Storage Gen 2 account, you do this by enabling "hierarchical namespace" when creating the account. Refer to the Azure documentation for further info.

  3. Create a container in the storage account

    In the storage account, create a new container. This will be used as a datastore for your Databricks catalog.

  4. Create an Access Connector for Azure Databricks

    This will allow Databricks to access your storage account. In the Azure Portal, search for "Access Connector for Azure Databricks" and create a new connector.

  5. Grant access to your storage container

    Navigate to the storage container you created before and select "Access control (IAM)" in the left-hand menu.

    Add a new role assignment and select "Storage Blob Data Contributor" as the role. Under "Members" select "Managed Identity" and add the Databricks Access Connector you created in the previous step.

2. Set up a metastore and Unity Catalog and get your access tokenโ€‹

  1. Now go to your Databricks workspace

    To get there from the Azure Portal, search for "Databricks", select your Databricks, and click "Launch Workspace".

  2. In the top right corner, click on your email address and go to "Manage Account"

  3. Go to "Data" and click on "Create Metastore"

    Name your metastore and select a region. If you'd like to set up a storage container for the whole metastore, you can add your ADLS URL and Access Connector Id here. You can also do this on a granular level when creating the catalog.

    In the next step, assign your metastore to your workspace.

  4. Go back to your workspace and click on "Catalog" in the left-hand menu

  5. Click "+ Add" and select "Add Storage Credential"

    Create a name and paste in the resource ID of the Databricks Access Connector from the Azure portal. It will look something like this: /subscriptions/<subscription_id>/resourceGroups/<resource_group>/providers/Microsoft.Databricks/accessConnectors/<connector_name>

  1. Click "+ Add" again and select "Add external location"

    Set the URL of our storage container. This should be in the form: abfss://<container_name>@<storage_account_name>.dfs.core.windows.net/<path>

    Once created, you can test the connection to make sure the container is accessible from Databricks.

  2. Now you can create a catalog

    Go to "Catalog" and click "Create Catalog". Name your catalog and select the storage location you created in the previous step.

  3. Create your access token

    Click your email in the top right corner and go to "User Settings". Go to "Developer" -> "Access Tokens". Generate a new token and save it. You will use it in your dlt configuration.

Loader setup Guideโ€‹

1. Initialize a project with a pipeline that loads to Databricks by running

dlt init chess databricks

2. Install the necessary dependencies for Databricks by running

pip install -r requirements.txt

This will install dlt with databricks extra which contains Databricks Python dbapi client.

4. Enter your credentials into .dlt/secrets.toml.

This should have your connection parameters and your personal access token.

You will find your server hostname and HTTP path in the Databricks workspace dashboard. Go to "SQL Warehouses", select your warehouse (default is called "Starter Warehouse") and go to "Connection details".

Example:

[destination.databricks.credentials]
server_hostname = "MY_DATABRICKS.azuredatabricks.net"
http_path = "/sql/1.0/warehouses/12345"
access_token = "MY_ACCESS_TOKEN"
catalog = "my_catalog"

Write dispositionโ€‹

All write dispositions are supported

Data loadingโ€‹

Data is loaded using INSERT VALUES statements by default.

Efficient loading from a staging filesystem is also supported by configuring an Amazon S3 or Azure Blob Storage bucket as a staging destination. When staging is enabled, dlt will upload data in parquet files to the bucket and then use COPY INTO statements to ingest the data into Databricks. For more information on staging, see the staging support section below.

Supported file formatsโ€‹

  • insert-values is used by default
  • jsonl supported when staging is enabled (see limitations below)
  • parquet supported when staging is enabled

The jsonl format has some limitations when used with Databricks:

  1. Compression must be disabled to load jsonl files in Databricks. Set data_writer.disable_compression to true in dlt config when using this format.
  2. The following data types are not supported when using jsonl format with databricks: decimal, complex, date, binary. Use parquet if your data contains these types.
  3. bigint data type with precision is not supported with jsonl format

Staging supportโ€‹

Databricks supports both Amazon S3 and Azure Blob Storage as staging locations. dlt will upload files in parquet format to the staging location and will instruct Databricks to load data from there.

Databricks and Amazon S3โ€‹

Please refer to the S3 documentation for details on connecting your S3 bucket with the bucket_url and credentials.

Example to set up Databricks with S3 as a staging destination:

import dlt

# Create a dlt pipeline that will load
# chess player data to the Databricks destination
# via staging on S3
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='databricks',
staging=dlt.destinations.filesystem('s3://your-bucket-name'), # add this to activate the staging location
dataset_name='player_data',
)

Databricks and Azure Blob Storageโ€‹

Refer to the Azure Blob Storage filesystem documentation for details on connecting your Azure Blob Storage container with the bucket_url and credentials.

Example to set up Databricks with Azure as a staging destination:

# Create a dlt pipeline that will load
# chess player data to the Databricks destination
# via staging on Azure Blob Storage
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='databricks',
staging=dlt.destinations.filesystem('az://your-container-name'), # add this to activate the staging location
dataset_name='player_data'
)

dbt supportโ€‹

This destination integrates with dbt via dbt-databricks

Syncing of dlt stateโ€‹

This destination fully supports dlt state sync.

Additional Setup guidesโ€‹

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.