Dremio
Install dlt with Dremioโ
To install the dlt library with Dremio and s3 dependencies:
pip install "dlt[dremio,s3]"
Setup guideโ
1. Initialize the dlt projectโ
Let's start by initializing a new dlt project as follows:
dlt init chess dremio
๐ก This command will initialize your pipeline with chess as the source and aws dremio as the destination using the filesystem staging destination.
2. Setup bucket storage and Dremio credentialsโ
First, install dependencies by running:
pip install -r requirements.txt
or with pip install "dlt[dremio,s3]"
which will install s3fs
, pyarrow
, and botocore
packages.
To edit the dlt
credentials file with your secret info, open .dlt/secrets.toml
. You will need to provide a bucket_url
which holds the uploaded parquet files.
The TOML file looks like this:
[destination.filesystem]
bucket_url = "s3://[your_bucket_name]" # replace with your bucket name,
[destination.filesystem.credentials]
aws_access_key_id = "please set me up!" # copy the access key here
aws_secret_access_key = "please set me up!" # copy the secret access key here
[destination.dremio]
staging_data_source = "<staging-data-source>" # the name of the "Object Storage" data source in Dremio containing the s3 bucket
[destination.dremio.credentials]
username = "<username>" # the Dremio username
password = "<password or pat token>" # Dremio password or PAT token
database = "<database>" # the name of the "data source" set up in Dremio where you want to load your data
host = "localhost" # the Dremio hostname
port = 32010 # the Dremio Arrow Flight grpc port
drivername="grpc" # either 'grpc' or 'grpc+tls'
You can also pass a SqlAlchemy-like connection like below:
[destination.dremio]
staging_data_source="s3_staging"
credentials="grpc://<username>:<password>@<host>:<port>/<data_source>"
If you have your credentials stored in ~/.aws/credentials
, just remove the [destination.filesystem.credentials] and [destination.dremio.credentials] sections above and dlt
will fall back to your default profile in local credentials. If you want to switch the profile, pass the profile name as follows (here: dlt-ci-user
):
[destination.filesystem.credentials]
profile_name="dlt-ci-user"
Write dispositionโ
dremio
destination handles the write dispositions as follows:
append
replace
merge
The merge
write disposition uses the default DELETE/UPDATE/INSERT strategy to merge data into the destination. Be aware that Dremio does not support transactions, so a partial pipeline failure can result in the destination table being in an inconsistent state. The merge
write disposition will eventually be implemented using MERGE INTO to resolve this issue.
Data loadingโ
Data loading happens by copying staged parquet files from an object storage bucket to the destination table in Dremio using COPY INTO statements. The destination table format is specified by the storage format for the data source in Dremio. Typically, this will be Apache Iceberg.
Dremio cannot load fixed_len_byte_array
columns from Parquet files.
Dataset creationโ
Dremio does not support CREATE SCHEMA
DDL statements.
Therefore, "Metastore" data sources, such as Hive or Glue, require that the dataset schema exists prior to running the dlt pipeline. dev_mode=True
is unsupported for these data sources.
"Object Storage" data sources do not have this limitation.
Staging supportโ
Using a staging destination is mandatory when using the Dremio destination. If you do not set staging to filesystem
, dlt will automatically do this for you.
Table partitioning and local sortโ
Apache Iceberg table partitions and local sort properties can be configured as shown below:
import dlt
from dlt.common.schema import TColumnSchema
@dlt.resource(
table_name="my_table",
columns=dict(
foo=TColumnSchema(partition=True),
bar=TColumnSchema(partition=True),
baz=TColumnSchema(sort=True),
),
)
def my_table_resource():
...
This will result in PARTITION BY ("foo","bar")
and LOCALSORT BY ("baz")
clauses being added to the CREATE TABLE
DML statement.
Note: Table partition migration is not implemented. The table will need to be dropped and recreated to alter partitions or localsort.
Syncing of dlt
stateโ
- This destination fully supports dlt state sync.
Additional Setup guidesโ
- Load data from Harvest to Dremio in python with dlt
- Load data from Stripe to Dremio in python with dlt
- Load data from CircleCI to Dremio in python with dlt
- Load data from Bitbucket to Dremio in python with dlt
- Load data from Rest API to Dremio in python with dlt
- Load data from Azure Cloud Storage to Dremio in python with dlt
- Load data from Google Cloud Storage to Dremio in python with dlt
- Load data from Box Platform API to Dremio in python with dlt
- Load data from Chargebee to Dremio in python with dlt
- Load data from Klaviyo to Dremio in python with dlt