Skip to main content

CSV file format

csv is the most basic file format to store tabular data, where all the values are strings and are separated by a delimiter (typically comma). dlt uses it for specific use cases - mostly for the performance and compatibility reasons.

Internally we use two implementations:

  • pyarrow csv writer - very fast, multithreaded writer for the arrow tables
  • python stdlib writer - a csv writer included in the Python standard library for Python objects

Supported Destinationsโ€‹

Supported by: Postgres, Filesystem

By setting the loader_file_format argument to csv in the run command, the pipeline will store your data in the csv format at the destination:

info = pipeline.run(some_source(), loader_file_format="csv")

Default Settingsโ€‹

dlt attempts to make both writers to generate similarly looking files

  • separators are commas
  • quotes are " and are escaped as ""
  • NULL values are empty strings
  • UNIX new lines are used
  • dates are represented as ISO 8601
  • quoting style is "when needed"

Change settingsโ€‹

You can change basic csv settings, this may be handy when working with filesystem destination. Other destinations are tested with standard settings:

  • delimiter: change the delimiting character (default: ',')
  • include_header: include the header row (default: True)
  • quoting: quote_all - all values are quoted, quote_needed - quote only values that need quoting (default: quote_needed)

When quote_needed is selected: in case of Python csv writer all non-numeric values are quoted. In case of pyarrow csv writer, the exact behavior is not described in the documentation. We observed that in some cases, strings are not quoted as well.

[normalize.data_writer]
delimiter="|"
include_header=false
quoting="quote_all"

Or using environment variables:

NORMALIZE__DATA_WRITER__DELIMITER=|
NORMALIZE__DATA_WRITER__INCLUDE_HEADER=False
NORMALIZE__DATA_WRITER__QUOTING=quote_all

Limitationsโ€‹

arrow writer

  • binary columns are supported only if they contain valid UTF-8 characters
  • complex (nested, struct) types are not supported

csv writer

  • binary columns are supported only if they contain valid UTF-8 characters (easy to add more encodings)
  • complex columns dumped with json.dumps
  • None values are always quoted

This demo works on codespaces. Codespaces is a development environment available for free to anyone with a Github account. You'll be asked to fork the demo repository and from there the README guides you with further steps.
The demo uses the Continue VSCode extension.

Off to codespaces!

DHelp

Ask a question

Welcome to "Codex Central", your next-gen help center, driven by OpenAI's GPT-4 model. It's more than just a forum or a FAQ hub โ€“ it's a dynamic knowledge base where coders can find AI-assisted solutions to their pressing problems. With GPT-4's powerful comprehension and predictive abilities, Codex Central provides instantaneous issue resolution, insightful debugging, and personalized guidance. Get your code running smoothly with the unparalleled support at Codex Central - coding help reimagined with AI prowess.