Fine tuning CSVs for fast ingestion


If you format your CSVs in the proper way, the ingestion speed can be up to 10x faster. This will let you import data faster, or ingest much more data in the same time. In this guide, we'll show you some small changes that you can make to format your CSVs in the optimal way, so that the import process is as fast as possible.

How our ingestion backend process works

It has two different phases,

  1. In the first one, it tries to import your CSV files in blocks (of 32MB, for example), assuming that the format is valid.
  2. If that fails and a block can't be processed (because the types don't match, or some lines don't have the right amount of fields), we pro-process the block line-by-line trying to solve those problems. The lines that can't be imported are sent to the Quarantine Data Source.

The first phase is much more performant (between 5x and 10x). It requires the CSV file to be formatted according to the RFC 4180 standard, and if it's not, the file is processed in the second way.

Once the Data Source schema has already been created, for the following requests we recommend not to include the header anymore for performance reasons. However, in case the header is included and it contains all names present in the Data Source schema, the ingestion will work even if the columns follow a different order than the one used when the Data Source was created.

How to format your CSVs so that ingestions are fast

Use the right escape character

You should not use the backslash (``\``) as an escape character, and use two double quotes (``""``) instead. So, for example, instead of formatting JSON fields like this:

you'd use two double quotes instead, like this:

Enclose multi-line strings in double quotes

You should also enclose multi-line string values in double quotes (``"``) to comply with the RFC 4180 standard.

So if you have a string field that can contain line breaks, like this:

you should format it like this instead:

Other performance tips

These will also make your imports faster

  1. Do not add a header row to the CSV
  2. Format ``DateTime`` columns like ``YYYY-MM-DD HH:MM:SS`` and ``Date`` columns like ``YYYY-MM-DD``
  3. Send CSVs that are sorted by the same sorting keys as the destination Data Source
  4. Send the encoding in the ``chartset`` part of the ``content-type`` header, if it's different to UTF-8, or use UTF-8. It should look like this: ``Content-Type: text/html; charset=utf-8``
  5. If you use a different line-break character, set it in the ``dialect_new_line`` query string parameter of the URL

A ~100MB file that doesn't follow these guidelines will take 11-12 seconds to be ingested to Tinybird, whereas if you format it according to these tips, it would be ingested in only about 1-2 seconds.