Version: 1.17.1 (latest)

Project tutorial

dltHub

This page is for dltHub Feature, which requires a license. Join our early access program for a trial license.

This tutorial introduces you to dlt+ Project and the essential cli commands needed to create and manage it. You will learn how to:

initialize a new dlt+ Project
navigate the dlt.yml file
add sources, destinations, and pipelines
run pipelines using cli commands
inspect datasets
work with dlt+ Profiles for enabling different configurations

Prerequisites

To follow this tutorial, make sure:

dlt+ is set up according to the installation guide
you're familiar with the core concepts of dlt

tip

You can find the full list of available cli commands under cli reference

Creating a new dlt+ Project

Start by creating a new folder for your project. Then, navigate to the folder in your terminal.

mkdir tutorial && cd tutorial

Run the following command to initialize a new dlt+ Project:

# Initialize a dlt+ Project named "tutorial", the name is derived from the folder name
dlt project init arrow duckdb

This command generates a project named tutorial with:

one pipeline
one Arrow source defined in sources/arrow.py
one DuckDB destination
one dataset on the DuckDB destination

warning

Currently, dlt project init only supports a limited number of sources (for example, REST API, SQL database, filesystem, etc.). To list all available sources, please use the cli command:

dlt source list-available

The support for other verified sources is coming soon!

The generated folder structure

After running the command, the following folder structure is created:

.
├── .dlt/                 # your dlt settings including profile settings
│   ├── dev.secrets.toml
│   └── secrets.toml
├── _data/                # local storage for your project, excluded from git
├── sources/              # your sources, contains the code for the arrow source
│   └── arrow.py
├── .gitignore
├── requirements.txt
└── dlt.yml               # the main project manifest

Understanding `dlt.yml`

The dlt.yml file is the central configuration for your dlt+ Project. It defines the pipelines, sources, and destinations. In the generated project, the file looks like this:

profiles:
  # profiles allow you to configure different settings for different environments
  dev: {}

# your sources are the data sources you want to load from
sources:
  arrow:
    type: sources.arrow.source

# your destinations are the databases where your data will be saved
destinations:
  duckdb:
    type: duckdb

# your datasets are the datasets on your destinations where your data will go
datasets: {}

# your pipelines orchestrate data loading actions
pipelines:
  my_pipeline:
    source: arrow
    destination: duckdb
    dataset_name: my_pipeline_dataset

tip

If you do not want to start with a source, destination, and pipeline, you can simply run dlt project init --project-name tutorial. This will generate a project with empty sources, destinations, and pipelines.

Some details about the project structure above:

The runtime section is analogous to the config.toml [runtime] section and could also be omitted in this case.
The profiles section is not doing much in this case. There are two implicit profiles: dev and tests that are present in any project; we will learn about profiles in more detail later.

You can reference environment variables in the dlt.yml file using the {env.ENV_VARIABLE_NAME} syntax. Additionally, dlt+ provides several predefined project variables that are automatically substituted during loading.

tip

You can find more information about the dlt.yml structure in the dlt+ Project section.

Running the pipeline

Once the project is initialized, you can run the pipeline using:

dlt pipeline my_pipeline run

This command:

Locates the pipeline named my_pipeline in dlt.yml.
Executes it, populating the duckdb destination that is defined to be stored in _data/dev/local/duckdb.duckdb.

tip

Take a look at the Projects context to learn more about how to work with nested projects and how dlt searches for the pipelines based on its name.

Inspecting the results

Use the dlt dataset command to interact with the dataset stored in the DuckDB destination. For example:

Counting the loaded rows

To count rows in the dataset, run:

dlt dataset my_pipeline_dataset row-counts

This will show the number of rows in the items table as specified by the arrow source. Additionally, the internal dlt tables are shown.

            table_name  row_count
              items        100
       _dlt_version          1
         _dlt_loads          1
_dlt_pipeline_state          1

View data

To view the first five rows of the items table:

dlt dataset my_pipeline_dataset head items

This displays the top entries in the items table, enabling quick validation of the pipeline's output. The output will be something like this:

Loading first 5 rows of table items.

   id   name  age
 0  jerry   49
 1    jim   25
 2   jane   46
 3   john   48
 4  jenny   49

To show more rows, use the --limit flag.

dlt dataset duckdb_dataset head items --limit 50

Adding sources, destinations, and pipelines to your project

Adding a new entity to an existing dlt+ Project is easy. You can add a new entity to your project by running the command:

dlt <entity_type> <entity_name> add

Depending on the entity you are adding, different options are available. To explore all commands, refer to the cli command reference. You can also use the --help option to see available settings for a specific entity. For example: dlt destination add --help. Let's individually add a source, destination, and pipeline to a new project, replicating the default project we created in the previous chapter.

Create an empty project

Delete all the files in the tutorial folder and run the following command to create an empty project:

dlt project init

This will create a project without any sources, destinations, datasets, or pipelines; the project will be named after the folder.

Add all entities

Now we can add all of our entities individually. This way, we can also give them their own names, which will be useful when having multiple destinations of the same type, for example.

Add a source with:

# add a new arrow source called "my_arrow_source"
dlt source my_arrow_source add arrow

Add a destination:

# add a new duckdb destination called "my_duckdb_destination"
# this will also create a new dataset called "my_duckdb_destination_dataset"
dlt destination my_duckdb_destination add duckdb

Now we can add a pipeline that uses the source and destination we just added:

# add a new pipeline called "my_pipeline" which loads from my_arrow_source and saves to my_duckdb_destination
# we select the my_duckdb_destination_dataset with the optional flag
dlt pipeline my_pipeline add my_arrow_source my_duckdb_destination

Adding the core source

You can add multiple entities using CLI commands. Let's add another source - this time, a core source such as a (REST API, SQL database, filesystem).

Run the following command to add an SQL database source named sql_db_1:

# add a new sql_database source called "sql_db_1"
dlt source sql_db_1 add sql_database

This will add the new source to your dlt.yml file:

sources:
  arrow:
    type: sources.arrow.source

  sql_db_1:
    type: sql_database

The corresponding credential placeholders will be added to .dlt/secrets.toml, but you can also define them in dlt.yml.

[sources.sql_db_1]
table_names = ["family", "clan"]

[sources.sql_db_1.credentials]
drivername = "mysql+pymysql"
database = "Rfam"
username = "rfamro"
host = "mysql-rfam-public.ebi.ac.uk"
port = 4497

Configuration and profiles

dlt+ introduces a new core concept - Profiles, which provides a way to manage different configurations for different environments. Let's have a look at our example project. The profiles section currently looks like this:

profiles:
  dev: {}

Which means the dev profile is empty and by default, all the settings are inherited from the project configuration. We can inspect the current state of the project configuration by running

dlt project --profile dev config show

This will show the current state of the project configuration with the dev profile loaded. If you don't specify the --profile option, the dev profile is used by default.

Adding a new profile

We can now create a new profile called prod that changes the location of the duckdb file we are loading to, as well as the log level of the project and the number of rows we are loading. Please run:

dlt profile prod add

And change the prod profile to the following:

  prod:
    sources:
      my_arrow_source:
        row_count: 200
    runtime:
      log_level: INFO
    destinations:
      my_duckdb_destination:
        credentials: my_data_prod.duckdb

We can now inspect the prod profile. You will see that the new settings are merged with the project configuration and the dev profile settings.

dlt project --profile prod config show

Run a pipeline with the new profile and inspect the results

Now, let's run the pipeline with the prod profile.

dlt pipeline --profile prod my_pipeline run

You can now see more output in the console due to the more verbose log level, and the number of rows loaded is now 200 instead of 100. Let's inspect our datasets for each profile (assuming you still have the duckdb database file from the previous chapter).

dlt dataset --profile dev my_duckdb_destination_dataset row-counts
dlt dataset --profile prod my_duckdb_destination_dataset row-counts

You will see that the number of rows loaded is now 200 instead of 100 in the prod profile.

tip

Profiles can also be inherited from other profiles; you can find more information in Profiles.

Using config files with profiles

You can also use the same configuration and secrets toml files and environment variables. You have probably noticed that your project contains more than one secrets file with the profile name prepended. These secrets files are only loaded if a given profile is active. Let's move the duckdb credentials, runtime settings, and source settings to the toml files instead of the dlt.yml file to demonstrate this:

First, remove all the content of the prod section in the dlt.yml file, but keep the key and the empty secrets file. We can also remove the runtime section from the dlt.yml file as well as the credentials key from the destination and the row_count key from the sources.my_arrow_source section. If you try to run the pipeline now, dlt will complain about missing configuration values:

dlt pipeline my_pipeline run

Now let's add the following to the dev.secrets.toml file:

[runtime]
log_level = "WARNING"

[destination.my_duckdb_destination]
credentials = "my_data.duckdb"

[sources.my_arrow_source]
row_count = 100

And the following to the prod.secrets.toml file:

[runtime]
log_level = "INFO"

[destination.my_duckdb_destination]
credentials = "my_data_prod.duckdb"

[sources.my_arrow_source]
row_count = 200

We can now clear the _data directory and repeat the steps above where you run both pipelines and inspect both datasets; you will see that the settings from the toml files are applied:

Load some data:

dlt pipeline --profile dev my_pipeline run
dlt pipeline --profile prod my_pipeline run

Inspect the datasets:

dlt dataset --profile dev my_duckdb_destination_dataset row-counts
dlt dataset --profile prod my_duckdb_destination_dataset row-counts

To locate your loaded data, check the _data\{profile name}\local directory.

Project tutorial

Prerequisites

Creating a new dlt+ Project

The generated folder structure

Understanding `dlt.yml`

Running the pipeline

Inspecting the results

Counting the loaded rows

View data

Adding sources, destinations, and pipelines to your project

Create an empty project

Add all entities

Adding the core source

Configuration and profiles

Adding a new profile

Run a pipeline with the new profile and inspect the results

Using config files with profiles

DHelp

Ask a question

Prerequisites​

Creating a new dlt+ Project​

The generated folder structure​

Understanding dlt.yml​

Running the pipeline​

Inspecting the results​

Counting the loaded rows​

View data​

Adding sources, destinations, and pipelines to your project​

Create an empty project​

Add all entities​

Adding the core source​

Configuration and profiles​

Adding a new profile​

Run a pipeline with the new profile and inspect the results​

Using config files with profiles​

DHelp

Ask a question

Prerequisites

Creating a new dlt+ Project

The generated folder structure

Understanding `dlt.yml`

Running the pipeline

Inspecting the results

Counting the loaded rows

View data

Adding sources, destinations, and pipelines to your project

Create an empty project

Add all entities

Adding the core source

Configuration and profiles

Adding a new profile

Run a pipeline with the new profile and inspect the results

Using config files with profiles