CSV file format
CSV is the most basic file format for storing tabular data, where all values are strings and are separated by a delimiter (typically a comma).
dlt
uses it for specific use cases - mostly for performance and compatibility reasons.
Internally, we use two implementations:
- Python standard library CSV writer
- PyArrow CSV writer - a very fast, multithreaded writer for Arrow tables
Supported destinations
How to configure
There are several ways of configuring dlt to usecsv
file format for normalization step and to store your data at the destination:
-
You can set the
loader_file_format
argument tocsv
in the run command:
info = pipeline.run(some_source(), loader_file_format="csv")
- Alternatively, you can set the file type directly in the resource decorator.
@dlt.resource(file_format="csv") def generate_rows(nr): pass
Default settings
dlt
attempts to make both writers generate similarly looking files:
- separators are commas
- quotes are " and are escaped as ""
NULL
values are both empty strings and empty tokens as in the example below- UNIX new lines are used
- dates are represented as ISO 8601
- quoting style is "when needed"
Example of NULLs:
text1,text2,text3
A,B,C
A,,""
In the last row, both text2
and text3
values are NULL. The Python csv
writer
is not able to write unquoted None
values, so we had to settle for ""
.
Note: all destinations capable of writing CSVs must support it.
Change settings
You can change basic csv
settings; this may be handy when working with the filesystem
destination. Other destinations are tested
with standard settings:
-
delimiter
: change the delimiting character (default: ',') -
include_header
: include the header row (default: True) -
quoting
: controls when quotes should be generated around field values. Available options:-
quote_needed
(default): quote only values that need quoting, i.e., non-numeric values- Python CSV writer: All non-numeric values are quoted
- PyArrow CSV writer: The exact behavior is not fully documented. We observed that in some cases, strings are not quoted as well
-
quote_all
: all values are quoted- Supported by both Python CSV writer and PyArrow CSV writer
-
quote_minimal
: quote only fields containing special characters (delimiter, quote character, or line terminator)- Supported by Python CSV writer only
-
quote_none
: never quote fields- Python CSV writer: Uses escape character when delimiter appears in data
- PyArrow CSV writer: Raises an error if data contains special characters
-
[normalize.data_writer]
delimiter="|"
include_header=false
quoting="quote_all"
Or using environment variables:
NORMALIZE__DATA_WRITER__DELIMITER=|
NORMALIZE__DATA_WRITER__INCLUDE_HEADER=False
NORMALIZE__DATA_WRITER__QUOTING=quote_all
Destination settings
A few additional settings are available when copying csv
to destination tables:
- on_error_continue - skip lines with errors (only Snowflake)
- encoding - encoding of the
csv
file
You'll need these settings when importing external files.
Limitations
arrow writer
- binary columns are supported only if they contain valid UTF-8 characters
- json (nested, struct) types are not supported
csv writer
- binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings)
- json columns dumped with json.dumps
- None values are always quoted