dltHub
Blog /

What’s new in dlt for Databricks: built-in staging, zero-config notebooks, no headaches

  • Aman Gupta,
    Jr. Data Engineer

Data ingestion to Databricks should be simple, but it rarely is.

If you've ever burned hours wiring up clusters, fiddling with IAM roles, and manually configuring external staging just to load a few gigabytes of data, you're not alone.

And if you've tried ingesting from APIs or external databases, you know the drill: Spark waits on HTTP while you babysit flaky endpoints and patch I/O with duct tape. It's like using a firehose to fill a glass.

We can’t kill the cluster (yet), but we can stop you from babysitting it. Spark’s scalability is still there when you need it, without the endless boilerplate for routine ingestion tasks.

Latest improvements to the Databricks destination

The Databricks destination in dlt has been upgraded to reduce setup complexity and eliminate common integration pain points. The changes focus on three key areas: staging, notebook support, and environment configuration.

Technical demo:

Ingest GitHub Issues into Databricks Using dlt + REST API (No Config Required)

Here’s how long it takes to go from zero to data in Databricks with dlt: about 30 lines of Python. No YAML. No CLI. No config spelunking.

What you'll build

A Python-only data ingestion pipeline that:

  • Uses GitHub’s REST API to extract open issues
  • Tracks updates incrementally using a cursor
  • Writes directly into Databricks with Unity Catalog support

Setup instructions

1. Create and configure your databricks environment

  • Spin up a Databricks workspace on any cloud provider.
  • Ensure the workspace is integrated with Unity Catalog.
  • Use the provided Notion init.sh script to configure the cluster for dltHub. (Paste it into the cluster’s init script configuration.)

2. Create a databricks notebook

In your Databricks workspace:

  • Create a new notebook (Python).
  • Place it in the directory where you’ll run your pipeline code.

3. Install the required packages

In a notebook cell, run:

%pip install "dlt[databricks]"

This installs the dlt library with Databricks support.

Pipeline Code: Ingest GitHub Issues via REST API

Paste the following into a new cell and run:

from typing import Any

import dlt
from dlt.common.pendulum import pendulum
from dlt.destinations import databricks
from dlt.sources.rest_api import rest_api_source

# --- Databricks destination setup --------------------------------------------

bricks = databricks(credentials={"catalog": "dltcheck"})  # Replace with your actual catalog

# --- GitHub REST API source configuration ------------------------------------

github_source = rest_api_source(
    {
        "client": {"base_url": "https://api.github.com/repos/dlt-hub/dlt/"},
        "resource_defaults": {
            "primary_key": "id",
            "write_disposition": "merge",
            "endpoint": {"params": {"per_page": 100}},  # GitHub's max page size
        },
        "resources": [
            {
                "name": "issues",
                "endpoint": {
                    "path": "issues",
                    "params": {
                        "state": "open",
                        "since": "{incremental.start_value}",
                    },
                    "incremental": {
                        "cursor_path": "updated_at",
                        "initial_value": pendulum.today()
                        .subtract(days=30)
                        .to_iso8601_string(),
                    },
                },
            },
        ],
    }
)

# --- Run the pipeline --------------------------------------------------------

pipeline = dlt.pipeline(
    pipeline_name="github_rest_api_example",
    dataset_name="rest_api_data",
    destination=bricks,
    progress="log",
)

load_info = pipeline.run(github_source)

print(load_info)

# Optional: View the issues as a DataFrame
print(pipeline.dataset().issues.df())

Summary

You now have a minimal GitHub-to-Databricks pipeline using only:

  • Python
  • REST API integration
  • Native Databricks support via dlt

No project scaffolding, no config files, no dlt init needed.

Wrapping up

Ingesting data into Databricks used to be a tedious side quest no one asked for. With dlt’s built-in staging and zero-config notebooks, things finally just work.

No duct tape, no config-file archaeology, no babysitting idle clusters.

Moving data shouldn’t be a heroic act. With dlt, it’s just Python.

Bringing dlt into your Databricks stack? Let’s get you what you need.

Whether you're kicking the tires or heading to prod, dlt's got you covered.