1. Initialize the dlt project
Let's start by initializing a new dlt project as follows:
dlt init chess redshift
💡 This command will initialize your pipeline with chess as the source and Redshift as the destination.
The above command generates several files and directories, including
.dlt/secrets.toml and a requirements file for Redshift. You can install the necessary dependencies specified in the requirements file by executing it as follows:
pip install -r requirements.txt
pip install dlt[redshift], which installs the
dlt library and the necessary dependencies for working with Amazon Redshift as a destination.
2. Setup Redshift cluster
To load data into Redshift, it is necessary to create a Redshift cluster and enable access to your IP address through the VPC inbound rules associated with the cluster. While we recommend asking our GPT-4 assistant for details, we have provided a general outline of the process below:
- You can use an existing cluster or create a new one.
- To create a new cluster, navigate to the 'Provisioned Cluster Dashboard' and click 'Create Cluster'.
- Specify the required details such as 'Cluster Identifier', 'Node Type', 'Admin User Name', 'Admin Password', and 'Database Name'.
- In the 'Network and Security' section, you can configure the cluster's VPC (Virtual Private Cloud). Remember to add your IP address to the inbound rules of the VPC on AWS.
3. Add credentials
Next, set up the Redshift credentials in the
.dlt/secrets.tomlfile as shown below:
database = "please set me up!" # copy your database name here
password = "please set me up!" # keep your redshift db instance password here
username = "please set me up!" # keep your redshift db instance username here
host = "please set me up!" # copy your redshift host from cluster endpoint here
port = 5439
connect_timeout = 15 # enter the timeout value
The "host" is derived from the cluster endpoint specified in the “General Configuration.” For example:
# If the endpoint is:
# Then the host is:
connect_timeoutis the number of minutes the pipeline will wait before the timeout.
You can also pass a database connection string similar to the one used by
psycopg2 library or SQLAlchemy. Credentials above will look like this:
# keep it at the top of your toml file! before any section starts
All write dispositions are supported.
Supported file formats
SQL Insert is used by default.
When staging is enabled:
❗ Redshift cannot load
dltwill fail such jobs permanently. Switch to
parquetto load binaries.
❗ Redshift cannot load
dltwill fail such jobs permanently. Switch to direct
insert_valuesto load time columns.
❗ Redshift cannot detect compression type from
jsonlfiles are gzip compressed which is the default.
❗ Redshift loads
complextypes as strings into SUPER with
jsonlformat to store JSON in SUPER natively or transform your SUPER columns with `PARSE_JSON``.
Supported column hints
Amazon Redshift supports the following column hints:
cluster- hint is a Redshift term for table distribution. Applying it to a column makes it the "DISTKEY," affecting query and join performance. Check the following documentation for more info.
sort- creates SORTKEY to order rows on disk physically. It is used to improve a query and join speed in Redshift, please read the sort key docs to learn more.
Redshift supports s3 as a file staging destination. dlt will upload files in the parquet format to s3 and ask redshift to copy their data directly into the db. Please refere to the S3 documentation to learn how to set up your s3 bucket with the bucket_url and credentials. The
dlt Redshift loader will use the aws credentials provided for s3 to access the s3 bucket if not specified otherwise (see config options below). Alternatively to parquet files, you can also specify jsonl as the staging file format. For this set the
loader_file_format argument of the
run command of the pipeline to
Authentication iam Role
If you would like to load from s3 without forwarding the aws staging credentials but authorize with an iam role connected to Redshift, follow the Redshift documentation to create a role with access to s3 linked to your redshift cluster and change your destination settings to use the iam role:
Redshift/S3 staging example code
# Create a dlt pipeline that will load
# chess player data to the redshift destination
# via staging on s3
pipeline = dlt.pipeline(
staging='filesystem', # add this to activate the staging location
Additional destination options
- This destination integrates with dbt via dbt-redshift. Credentials and timeout settings are shared automatically with
- This destination fully supports dlt state sync.
Supported loader file formats
Supported loader file formats for Redshift are
insert_values (default). When using a staging location, Redshift supports