Skip to main content
Version: 1.7.0 (latest)

Project tutorial

dlt+

This page is for dlt+, which requires a license. Join our early access program for a trial license.

This tutorial introduces you to dlt+ Project and the essential cli commands needed to create and manage it. You will learn how to:

  • initialize a new dlt+ Project
  • navigate the dlt.yml file
  • add sources, destinations, and pipelines
  • run pipelines using cli commands
  • inspect datasets
  • work with dlt+ Profiles for enabling different configurations

Prerequisites

To follow this tutorial, make sure:

tip

You can find the full list of available cli commands under cli reference

Creating a new dlt+ Project

Start by creating a new folder for your project. Then, navigate to the folder in your terminal.

mkdir tutorial && cd tutorial

Run the following command to initialize a new dlt+ Project:

# Initialize a dlt+ Project named "tutorial", the name is derived from the folder name
dlt project init arrow duckdb

This command generates a project named tutorial with:

  • one pipeline
  • one Arrow source defined in sources/arrow.py
  • one DuckDB destination
  • one dataset on the DuckDB destination
caution

Currently, dlt project init only supports a limited number of sources (for example, REST API, SQL database, filesystem, etc.). To list all available sources, please use the cli command:

dlt source list-available

The support for other verified sources is coming soon!

The generated folder structure

After running the command, the following folder structure is created:

.
├── .dlt/ # your dlt settings including profile settings
│ ├── config.toml
│ ├── dev.secrets.toml
│ └── secrets.toml
├── _data/ # local storage for your project, excluded from git
├── sources/ # your sources, contains the code for the arrow source
│ └── arrow.py
├── .gitignore
├── requirements.txt
└── dlt.yml # the main project manifest

Understanding dlt.yml

The dlt.yml file is the central configuration for your dlt+ Project. It defines the pipelines, sources, and destinations. In the generated project, the file looks like this:

profiles:
# profiles allow you to configure different settings for different environments
dev: {}

# your sources are the data sources you want to load from
sources:
arrow:
type: sources.arrow.source

# your destinations are the databases where your data will be saved
destinations:
duckdb:
type: duckdb

# your datasets are the datasets on your destinations where your data will go
datasets: {}

# your pipelines orchestrate data loading actions
pipelines:
my_pipeline:
source: arrow
destination: duckdb
dataset_name: my_pipeline_dataset
tip

If you do not want to start with a source, destination, and pipeline, you can simply run dlt project init --project-name tutorial. This will generate a project with empty sources, destinations, and pipelines.

Some details about the project structure above:

  • The runtime section is analogous to the config.toml [runtime] section and could also be omitted in this case.
  • The profiles section is not doing much in this case. There are two implicit profiles: dev and tests that are present in any project; we will learn about profiles in more detail later.

You can reference environment variables in the dlt.yml file using the {env.ENV_VARIABLE_NAME} syntax. Additionally, dlt+ provides several predefined project variables that are automatically substituted during loading.

tip

You can find more information about the dlt.yml structure in the dlt+ Project section.

Running the pipeline

Once the project is initialized, you can run the pipeline using:

dlt pipeline my_pipeline run

This command:

  • Locates the pipeline named my_pipeline in dlt.yml.
  • Executes it, populating the duckdb destination that is defined to be stored in _data/dev/local/duckdb.duckdb.
tip

Take a look at the Projects context to learn more about how to work with nested projects and how dlt searches for the pipelines based on its name.

Inspecting the results

Use the dlt dataset command to interact with the dataset stored in the DuckDB destination. For example:

Counting the loaded rows

To count rows in the dataset, run:

dlt dataset my_pipeline_dataset row-counts

This will show the number of rows in the items table as specified by the arrow source. Additionally, the internal dlt tables are shown.

            table_name  row_count
0 items 100
1 _dlt_version 1
2 _dlt_loads 1
3 _dlt_pipeline_state 1

View data

To view the first five rows of the items table:

dlt dataset my_pipeline_dataset head items

This displays the top entries in the items table, enabling quick validation of the pipeline's output. The output will be something like this:

Loading first 5 rows of table items.

id name age
0 0 jerry 49
1 1 jim 25
2 2 jane 46
3 3 john 48
4 4 jenny 49

To show more rows, use the --limit flag.

dlt dataset duckdb_dataset head items --limit 50

Adding sources, destinations, and pipelines to your project

Adding a new entity to an existing dlt+ Project is easy. You can add a new entity to your project by running the command:

dlt <entity_type> <entity_name> add

Depending on the entity you are adding, different options are available. To explore all commands, refer to the cli command reference. You can also use the --help option to see available settings for a specific entity. For example: dlt destination add --help. Let's individually add a source, destination, and pipeline to a new project, replicating the default project we created in the previous chapter.

Create an empty project

Delete all the files in the tutorial folder and run the following command to create an empty project:

dlt project init

This will create a project without any sources, destinations, datasets, or pipelines; the project will be named after the folder.

Add all entities

Now we can add all of our entities individually. This way, we can also give them their own names, which will be useful when having multiple destinations of the same type, for example.

Add a source with:

# add a new arrow source called "my_arrow_source"
dlt source my_arrow_source add arrow

Add a destination:

# add a new duckdb destination called "my_duckdb_destination"
# this will also create a new dataset called "my_duckdb_destination_dataset"
dlt destination my_duckdb_destination add duckdb

Now we can add a pipeline that uses the source and destination we just added:

# add a new pipeline called "my_pipeline" which loads from my_arrow_source and saves to my_duckdb_destination
# we select the my_duckdb_destination_dataset with the optional flag
dlt pipeline my_pipeline add my_arrow_source my_duckdb_destination

Adding the core source

You can add multiple entities using CLI commands. Let's add another source - this time, a core source such as a (REST API, SQL database, filesystem).

Run the following command to add an SQL database source named sql_db_1:

# add a new sql_database source called "sql_db_1"
dlt source sql_db_1 add sql_database

This will add the new source to your dlt.yml file:

sources:
arrow:
type: sources.arrow.source

sql_db_1:
type: sql_database

The corresponding credential placeholders will be added to .dlt/secrets.toml, but you can also define them in dlt.yml.

[sources.sql_db_1]
table_names = ["family", "clan"]

[sources.sql_db_1.credentials]
drivername = "mysql+pymysql"
database = "Rfam"
username = "rfamro"
host = "mysql-rfam-public.ebi.ac.uk"
port = 4497

Configuration and profiles

dlt+ introduces a new core concept - Profiles, which provides a way to manage different configurations for different environments. Let's have a look at our example project. The profiles section currently looks like this:

profiles:
dev: {}

Which means the dev profile is empty and by default, all the settings are inherited from the project configuration. We can inspect the current state of the project configuration by running

dlt project --profile dev config show

This will show the current state of the project configuration with the dev profile loaded. If you don't specify the --profile option, the dev profile is used by default.

Adding a new profile

We can now create a new profile called prod that changes the location of the duckdb file we are loading to, as well as the log level of the project and the number of rows we are loading. Please run:

dlt profile prod add

And change the prod profile to the following:

  prod:
sources:
my_arrow_source:
row_count: 200
runtime:
log_level: INFO
destinations:
my_duckdb_destination:
credentials: my_data_prod.duckdb

We can now inspect the prod profile. You will see that the new settings are merged with the project configuration and the dev profile settings.

dlt project --profile prod config show

Run a pipeline with the new profile and inspect the results

Now, let's run the pipeline with the prod profile.

dlt pipeline --profile prod my_pipeline run

You can now see more output in the console due to the more verbose log level, and the number of rows loaded is now 200 instead of 100. Let's inspect our datasets for each profile (assuming you still have the duckdb database file from the previous chapter).

dlt dataset --profile dev my_duckdb_destination_dataset row-counts
dlt dataset --profile prod my_duckdb_destination_dataset row-counts

You will see that the number of rows loaded is now 200 instead of 100 in the prod profile.

tip

Profiles can also be inherited from other profiles; you can find more information in Profiles.

Using config files with profiles

You can also use the same configuration and secrets toml files and environment variables. You have probably noticed that your project contains more than one secrets file with the profile name prepended. These secrets files are only loaded if a given profile is active. Let's move the duckdb credentials, runtime settings, and source settings to the toml files instead of the dlt.yml file to demonstrate this:

First, remove all the content of the prod section in the dlt.yml file, but keep the key and the empty secrets file. We can also remove the runtime section from the dlt.yml file as well as the credentials key from the destination and the row_count key from the sources.my_arrow_source section. If you try to run the pipeline now, dlt will complain about missing configuration values:

dlt pipeline my_pipeline run

Now let's add the following to the dev.secrets.toml file:

[runtime]
log_level = "WARNING"

[destination.my_duckdb_destination]
credentials = "my_data.duckdb"

[sources.my_arrow_source]
row_count = 100

And the following to the prod.secrets.toml file:

[runtime]
log_level = "INFO"

[destination.my_duckdb_destination]
credentials = "my_data_prod.duckdb"

[sources.my_arrow_source]
row_count = 200

We can now clear the _data directory and repeat the steps above where you run both pipelines and inspect both datasets; you will see that the settings from the toml files are applied:

Load some data:

dlt pipeline --profile dev my_pipeline run
dlt pipeline --profile prod my_pipeline run

Inspect the datasets:

dlt dataset --profile dev my_duckdb_destination_dataset row-counts
dlt dataset --profile prod my_duckdb_destination_dataset row-counts

To locate your loaded data, check the _data\{profile name}\local directory.

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub – it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.