Skip to main content

ClickHouse

Install dlt with ClickHouseโ€‹

To install the DLT library with ClickHouse dependencies:

pip install "dlt[clickhouse]"

Setup Guideโ€‹

1. Initialize the dlt projectโ€‹

Let's start by initializing a new dlt project as follows:

dlt init chess clickhouse

๐Ÿ’ก This command will initialize your pipeline with chess as the source and ClickHouse as the destination.

The above command generates several files and directories, including .dlt/secrets.toml and a requirements file for ClickHouse. You can install the necessary dependencies specified in the requirements file by executing it as follows:

pip install -r requirements.txt

or with pip install "dlt[clickhouse]", which installs the dlt library and the necessary dependencies for working with ClickHouse as a destination.

2. Setup ClickHouse databaseโ€‹

To load data into ClickHouse, you need to create a ClickHouse database. While we recommend asking our GPT-4 assistant for details, we have provided a general outline of the process below:

  1. You can use an existing ClickHouse database or create a new one.

  2. To create a new database, connect to your ClickHouse server using the clickhouse-client command line tool or a SQL client of your choice.

  3. Run the following SQL commands to create a new database, user and grant the necessary permissions:

    CREATE DATABASE IF NOT EXISTS dlt;
    CREATE USER dlt IDENTIFIED WITH sha256_password BY 'Dlt*12345789234567';
    GRANT CREATE, ALTER, SELECT, DELETE, DROP, TRUNCATE, OPTIMIZE, SHOW, INSERT, dictGet ON dlt.* TO dlt;
    GRANT SELECT ON INFORMATION_SCHEMA.COLUMNS TO dlt;
    GRANT CREATE TEMPORARY TABLE, S3 ON *.* TO dlt;

3. Add credentialsโ€‹

  1. Next, set up the ClickHouse credentials in the .dlt/secrets.toml file as shown below:

    [destination.clickhouse.credentials]
    database = "dlt" # The database name you created
    username = "dlt" # ClickHouse username, default is usually "default"
    password = "Dlt*12345789234567" # ClickHouse password if any
    host = "localhost" # ClickHouse server host
    port = 9000 # ClickHouse HTTP port, default is 9000
    http_port = 8443 # HTTP Port to connect to ClickHouse server's HTTP interface. Defaults to 8443.
    secure = 1 # Set to 1 if using HTTPS, else 0.
    dataset_table_separator = "___" # Separator for dataset table names from dataset.
    http_port

    The http_port parameter specifies the port number to use when connecting to the ClickHouse server's HTTP interface. This is different from default port 9000, which is used for the native TCP protocol.

    You must set http_port if you are not using external staging (i.e. you don't set the staging parameter in your pipeline). This is because dlt's built-in ClickHouse local storage staging uses the clickhouse-connect library, which communicates with ClickHouse over HTTP.

    Make sure your ClickHouse server is configured to accept HTTP connections on the port specified by http_port. For example, if you set http_port = 8443, then ClickHouse should be listening for HTTP requests on port 8443. If you are using external staging, you can omit the http_port parameter, since clickhouse-connect will not be used in this case.

  2. You can pass a database connection string similar to the one used by the clickhouse-driver library. The credentials above will look like this:

    # keep it at the top of your toml file, before any section starts.
    destination.clickhouse.credentials="clickhouse://dlt:Dlt*12345789234567@localhost:9000/dlt?secure=1"

Write dispositionโ€‹

All write dispositions are supported.

Data loadingโ€‹

Data is loaded into ClickHouse using the most efficient method depending on the data source:

  • For local files, the clickhouse-connect library is used to directly load files into ClickHouse tables using the INSERT command.
  • For files in remote storage like S3, Google Cloud Storage, or Azure Blob Storage, ClickHouse table functions like s3, gcs and azureBlobStorage are used to read the files and insert the data into tables.

Datasetsโ€‹

Clickhouse does not support multiple datasets in one database, dlt relies on datasets to exist for multiple reasons. To make clickhouse work with dlt, tables generated by dlt in your clickhouse database will have their name prefixed with the dataset name separated by the configurable dataset_table_separator. Additionally, a special sentinel table that does not contain any data will be created, so dlt knows which virtual datasets already exist in a clickhouse destination.

Supported file formatsโ€‹

  • jsonl is the preferred format for both direct loading and staging.
  • parquet is supported for both direct loading and staging.

The clickhouse destination has a few specific deviations from the default sql destinations:

  1. Clickhouse has an experimental object datatype, but we have found it to be a bit unpredictable, so the dlt clickhouse destination will load the complex datatype to a text column. If you need this feature, get in touch with our Slack community, and we will consider adding it.
  2. Clickhouse does not support the time datatype. Time will be loaded to a text column.
  3. Clickhouse does not support the binary datatype. Binary will be loaded to a text column. When loading from jsonl, this will be a base64 string, when loading from parquet this will be the binary object converted to text.
  4. Clickhouse accepts adding columns to a populated table that are not null.
  5. Clickhouse can produce rounding errors under certain conditions when using the float / double datatype. Make sure to use decimal if you cannot afford to have rounding errors. Loading the value 12.7001 to a double column with the loader file format jsonl set will predictbly produce a rounding error for example.

Supported column hintsโ€‹

ClickHouse supports the following column hints:

  • primary_key - marks the column as part of the primary key. Multiple columns can have this hint to create a composite primary key.

Table Engineโ€‹

By default, tables are created using the ReplicatedMergeTree table engine in ClickHouse. You can specify an alternate table engine using the table_engine_type with the clickhouse adapter:

from dlt.destinations.adapters import clickhouse_adapter


@dlt.resource()
def my_resource():
...


clickhouse_adapter(my_resource, table_engine_type="merge_tree")

Supported values are:

  • merge_tree - creates tables using the MergeTree engine
  • replicated_merge_tree (default) - creates tables using the ReplicatedMergeTree engine

Staging supportโ€‹

ClickHouse supports Amazon S3, Google Cloud Storage and Azure Blob Storage as file staging destinations.

dlt will upload Parquet or JSONL files to the staging location and use ClickHouse table functions to load the data directly from the staged files.

Please refer to the filesystem documentation to learn how to configure credentials for the staging destinations:

To run a pipeline with staging enabled:

pipeline = dlt.pipeline(
pipeline_name='chess_pipeline',
destination='clickhouse',
staging='filesystem', # add this to activate staging
dataset_name='chess_data'
)

Using Google Cloud or S3-Compatible Storage as a Staging Areaโ€‹

dlt supports using S3-compatible storage services, including Google Cloud Storage (GCS), as a staging area when loading data into ClickHouse. This is handled automatically by ClickHouse's GCS table function, which dlt uses under the hood.

The ClickHouse GCS table function only supports authentication using Hash-based Message Authentication Code (HMAC) keys, which is compatible with the Amazon S3 API. To enable this, GCS provides an S3 compatibility mode that emulates the S3 API, allowing ClickHouse to access GCS buckets via its S3 integration.

For detailed instructions on setting up S3-compatible storage with dlt, including AWS S3, MinIO, and Cloudflare R2, refer to the dlt documentation on filesystem destinations.

To set up GCS staging with HMAC authentication in dlt:

  1. Create HMAC keys for your GCS service account by following the Google Cloud guide.

  2. Configure the HMAC keys (aws_access_key_id and aws_secret_access_key) in your dlt project's ClickHouse destination settings in config.toml, similar to how you would configure AWS S3 credentials:

[destination.filesystem]
bucket_url = "s3://my_awesome_bucket"

[destination.filesystem.credentials]
aws_access_key_id = "JFJ$$*f2058024835jFffsadf"
aws_secret_access_key = "DFJdwslf2hf57)%$02jaflsedjfasoi"
project_id = "my-awesome-project"
endpoint_url = "https://storage.googleapis.com"
caution

When configuring the bucket_url for S3-compatible storage services like Google Cloud Storage (GCS) with ClickHouse in dlt, ensure that the URL is prepended with s3:// instead of gs://. This is because the ClickHouse GCS table function requires the use of HMAC credentials, which are compatible with the S3 API. Prepending with s3:// allows the HMAC credentials to integrate properly with dlt's staging mechanisms for ClickHouse.

dbt supportโ€‹

Integration with dbt is generally supported via dbt-clickhouse, but not tested by us.

Syncing of dlt stateโ€‹

This destination fully supports dlt state sync.

Additional Setup guidesโ€‹

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.