Version: devel

CSV file format

CSV is the most basic file format for storing tabular data, where all values are strings and are separated by a delimiter (typically a comma). dlt uses it for specific use cases - mostly for performance and compatibility reasons.

Internally, we use two implementations:

Python standard library CSV writer
PyArrow CSV writer - a very fast, multithreaded writer for Arrow tables

Supported destinations

How to configure

There are several ways of configuring dlt to use csv file format for normalization step and to store your data at the destination:

You can set the loader_file_format argument to csv in the run command:

info = pipeline.run(some_source(), loader_file_format="csv")

Alternatively, you can set the file type directly in the resource decorator.

@dlt.resource(file_format="csv")

def generate_rows(nr):

    pass

Default settings

dlt attempts to make both writers generate similarly looking files:

separators are commas
quotes are " and are escaped as ""
NULL values are both empty strings and empty tokens as in the example below
UNIX new lines ("\n") are used by default
dates are represented as ISO 8601
quoting style is "when needed"

Example of NULLs:

text1,text2,text3
A,B,C
A,,""

In the last row, both text2 and text3 values are NULL. The Python csv writer is not able to write unquoted None values, so we had to settle for "".

Note: all destinations capable of writing CSVs must support it.

Change settings

You can change basic csv settings; this may be handy when working with the filesystem destination. Other destinations are tested with standard settings:

delimiter: change the delimiting character (default: ',')
include_header: include the header row (default: True)
lineterminator: specify the string used to terminate lines (default: \n - UNIX line endings, use \r\n for Windows line endings)
quoting: controls when quotes should be generated around field values. Available options:
- quote_needed (default): quote only values that need quoting, i.e., non-numeric values
  - Python CSV writer: All non-numeric values are quoted
  - PyArrow CSV writer: The exact behavior is not fully documented. We observed that in some cases, strings are not quoted as well
- quote_all: all values are quoted
  - Supported by both Python CSV writer and PyArrow CSV writer
- quote_minimal: quote only fields containing special characters (delimiter, quote character, or line terminator)
  - Supported by Python CSV writer only
- quote_none: never quote fields
  - Python CSV writer: Uses escape character when delimiter appears in data
  - PyArrow CSV writer: Raises an error if data contains special characters

[normalize.data_writer]
delimiter="|"
include_header=false
quoting="quote_all"
lineterminator="\r\n"

Or using environment variables:

NORMALIZE__DATA_WRITER__DELIMITER=|
NORMALIZE__DATA_WRITER__INCLUDE_HEADER=False
NORMALIZE__DATA_WRITER__QUOTING=quote_all
NORMALIZE__DATA_WRITER__LINETERMINATOR=$"\r\n"

Note the "$" prefix before "\r\n" to escape the newline character when using environment variables.

Destination settings

A few additional settings are available when copying csv to destination tables:

on_error_continue - skip lines with errors (only Snowflake)
encoding - encoding of the csv file

tip

You'll need these settings when importing external files.

Limitations

arrow writer

binary columns are supported only if they contain valid UTF-8 characters
json (nested, struct) types are not supported

csv writer

binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings)
json columns dumped with json.dumps
None values are always quoted

CSV file format

Supported destinations

How to configure

Default settings

Change settings

Destination settings

Limitations

DHelp

Ask a question

Supported destinations​

How to configure​

Default settings​

Change settings​

Destination settings​

Limitations​

DHelp

Ask a question

Supported destinations

How to configure

Default settings

Change settings

Destination settings

Limitations