ClickHouse
Install dlt with ClickHouseโ
To install the DLT library with ClickHouse dependencies:
pip install "dlt[clickhouse]"
Setup Guideโ
1. Initialize the dlt projectโ
Let's start by initializing a new dlt
project as follows:
dlt init chess clickhouse
๐ก This command will initialize your pipeline with chess as the source and ClickHouse as the destination.
The above command generates several files and directories, including .dlt/secrets.toml
and a requirements file for ClickHouse. You can install the necessary dependencies specified in the
requirements file by executing it as follows:
pip install -r requirements.txt
or with pip install "dlt[clickhouse]"
, which installs the dlt
library and the necessary dependencies for working with ClickHouse as a destination.
2. Setup ClickHouse databaseโ
To load data into ClickHouse, you need to create a ClickHouse database. While we recommend asking our GPT-4 assistant for details, we have provided a general outline of the process below:
You can use an existing ClickHouse database or create a new one.
To create a new database, connect to your ClickHouse server using the
clickhouse-client
command line tool or a SQL client of your choice.Run the following SQL commands to create a new database, user and grant the necessary permissions:
CREATE DATABASE IF NOT EXISTS dlt;
CREATE USER dlt IDENTIFIED WITH sha256_password BY 'Dlt*12345789234567';
GRANT CREATE, ALTER, SELECT, DELETE, DROP, TRUNCATE, OPTIMIZE, SHOW, INSERT, dictGet ON dlt.* TO dlt;
GRANT SELECT ON INFORMATION_SCHEMA.COLUMNS TO dlt;
GRANT CREATE TEMPORARY TABLE, S3 ON *.* TO dlt;
3. Add credentialsโ
Next, set up the ClickHouse credentials in the
.dlt/secrets.toml
file as shown below:[destination.clickhouse.credentials]
database = "dlt" # The database name you created
username = "dlt" # ClickHouse username, default is usually "default"
password = "Dlt*12345789234567" # ClickHouse password if any
host = "localhost" # ClickHouse server host
port = 9000 # ClickHouse HTTP port, default is 9000
http_port = 8443 # HTTP Port to connect to ClickHouse server's HTTP interface. Defaults to 8443.
secure = 1 # Set to 1 if using HTTPS, else 0.
dataset_table_separator = "___" # Separator for dataset table names from dataset.http_portThe
http_port
parameter specifies the port number to use when connecting to the ClickHouse server's HTTP interface. This is different from default port 9000, which is used for the native TCP protocol.You must set
http_port
if you are not using external staging (i.e. you don't set the staging parameter in your pipeline). This is because dlt's built-in ClickHouse local storage staging uses the clickhouse-connect library, which communicates with ClickHouse over HTTP.Make sure your ClickHouse server is configured to accept HTTP connections on the port specified by
http_port
. For example, if you sethttp_port = 8443
, then ClickHouse should be listening for HTTP requests on port 8443. If you are using external staging, you can omit thehttp_port
parameter, since clickhouse-connect will not be used in this case.You can pass a database connection string similar to the one used by the
clickhouse-driver
library. The credentials above will look like this:# keep it at the top of your toml file, before any section starts.
destination.clickhouse.credentials="clickhouse://dlt:Dlt*12345789234567@localhost:9000/dlt?secure=1"
Write dispositionโ
All write dispositions are supported.
Data loadingโ
Data is loaded into ClickHouse using the most efficient method depending on the data source:
- For local files, the
clickhouse-connect
library is used to directly load files into ClickHouse tables using theINSERT
command. - For files in remote storage like S3, Google Cloud Storage, or Azure Blob Storage, ClickHouse table functions like
s3
,gcs
andazureBlobStorage
are used to read the files and insert the data into tables.
Datasetsโ
Clickhouse
does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons.
To make clickhouse
work with dlt
, tables generated by dlt
in your clickhouse
database will have their name prefixed with the dataset name separated by
the configurable dataset_table_separator
. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a
clickhouse
destination.
Supported file formatsโ
- jsonl is the preferred format for both direct loading and staging.
- parquet is supported for both direct loading and staging.
The clickhouse
destination has a few specific deviations from the default sql destinations:
Clickhouse
has an experimentalobject
datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to atext
column. If you need this feature, get in touch with our Slack community, and we will consider adding it.Clickhouse
does not support thetime
datatype. Time will be loaded to atext
column.Clickhouse
does not support thebinary
datatype. Binary will be loaded to atext
column. When loading fromjsonl
, this will be a base64 string, when loading from parquet this will be thebinary
object converted totext
.Clickhouse
accepts adding columns to a populated table that are not null.Clickhouse
can produce rounding errors under certain conditions when using the float / double datatype. Make sure to use decimal if you cannot afford to have rounding errors. Loading the value 12.7001 to a double column with the loader file format jsonl set will predictbly produce a rounding error for example.
Supported column hintsโ
ClickHouse supports the following column hints:
primary_key
- marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.
Table Engineโ
By default, tables are created using the ReplicatedMergeTree
table engine in ClickHouse. You can specify an alternate table engine using the table_engine_type
with the clickhouse adapter:
from dlt.destinations.adapters import clickhouse_adapter
@dlt.resource()
def my_resource():
...
clickhouse_adapter(my_resource, table_engine_type="merge_tree")
Supported values are:
merge_tree
- creates tables using theMergeTree
enginereplicated_merge_tree
(default) - creates tables using theReplicatedMergeTree
engine
Staging supportโ
ClickHouse supports Amazon S3, Google Cloud Storage and Azure Blob Storage as file staging destinations.
dlt
will upload Parquet or JSONL files to the staging location and use ClickHouse table functions to load the data directly from the staged files.
Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations:
To run a pipeline with staging enabled:
pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='clickhouse',
staging='filesystem', # add this to activate staging
dataset_name='chess_data'
)
Using Google Cloud Storage as a Staging Areaโ
dlt supports using Google Cloud Storage (GCS) as a staging area when loading data into ClickHouse. This is handled automatically by ClickHouse's GCS table function which dlt uses under the hood.
The clickhouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys. To enable this, GCS provides an S3 compatibility mode that emulates the Amazon S3 API. ClickHouse takes advantage of this to allow accessing GCS buckets via its S3 integration.
To set up GCS staging with HMAC authentication in dlt:
Create HMAC keys for your GCS service account by following the Google Cloud guide.
Configure the HMAC keys as well as the
client_email
,project_id
andprivate_key
for your service account in your dlt project's ClickHouse destination settings inconfig.toml
:
[destination.filesystem]
bucket_url = "gs://dlt-ci"
[destination.filesystem.credentials]
project_id = "a-cool-project"
client_email = "my-service-account@a-cool-project.iam.gserviceaccount.com"
private_key = "-----BEGIN PRIVATE KEY-----\nMIIEvQIBADANBgkaslkdjflasjnkdcopauihj...wEiEx7y+mx\nNffxQBqVVej2n/D93xY99pM=\n-----END PRIVATE KEY-----\n"
[destination.clickhouse.credentials]
database = "dlt"
username = "dlt"
password = "Dlt*12345789234567"
host = "localhost"
port = 9440
secure = 1
gcp_access_key_id = "JFJ$$*f2058024835jFffsadf"
gcp_secret_access_key = "DFJdwslf2hf57)%$02jaflsedjfasoi"
Note: In addition to the HMAC keys (gcp_access_key_id
and gcp_secret_access_key
), you now need to provide the client_email
, project_id
and private_key
for your service account
under [destination.filesystem.credentials]
.
This is because the GCS staging support is now implemented as a temporary workaround and is still unoptimized.
dlt will pass these credentials to ClickHouse which will handle the authentication and GCS access.
There is active work in progress to simplify and improve the GCS staging setup for the ClickHouse dlt destination in the future. Proper GCS staging support is being tracked in these GitHub issues:
dbt supportโ
Integration with dbt is generally supported via dbt-clickhouse, but not tested by us.
Syncing of dlt
stateโ
This destination fully supports dlt state sync.
Additional Setup guidesโ
- Load data from Stripe to ClickHouse in python with dlt
- Load data from Slack to ClickHouse in python with dlt
- Load data from Google Sheets to ClickHouse in python with dlt
- Load data from PostgreSQL to ClickHouse in python with dlt
- Load data from SAP HANA to ClickHouse in python with dlt
- Load data from The Local Filesystem to ClickHouse in python with dlt
- Load data from Mux to ClickHouse in python with dlt
- Load data from AWS S3 to ClickHouse in python with dlt
- Load data from Salesforce to ClickHouse in python with dlt
- Load data from Shopify to ClickHouse in python with dlt
- Load data from Spotify to ClickHouse in python with dlt
- Load data from Chargebee to ClickHouse in python with dlt
- Load data from Rest API to ClickHouse in python with dlt
- Load data from Google Cloud Storage to ClickHouse in python with dlt
- Load data from MySQL to ClickHouse in python with dlt
- Load data from GitHub to ClickHouse in python with dlt
- Load data from HubSpot to ClickHouse in python with dlt
- Load data from Box Platform API to ClickHouse in python with dlt
- Load data from Chess.com to ClickHouse in python with dlt
- Load data from Pipedrive to ClickHouse in python with dlt
- Load data from Apple App-Store Connect to ClickHouse in python with dlt
- Load data from Azure Cloud Storage to ClickHouse in python with dlt
- Load data from Zendesk to ClickHouse in python with dlt
- Load data from Sentry to ClickHouse in python with dlt
- Load data from Google Analytics to ClickHouse in python with dlt
- Load data from Microsoft SQL Server to ClickHouse in python with dlt
- Load data from Jira to ClickHouse in python with dlt
- Load data from Oracle Database to ClickHouse in python with dlt
- Load data from IBM Db2 to ClickHouse in python with dlt
- Load data from Airtable to ClickHouse in python with dlt
- Load data from Notion to ClickHouse in python with dlt
- Load data from MongoDB to ClickHouse in python with dlt