Calculating data on ingestion with Materialized Columns


Materialized columns are a special type of columns that are calculated at ingestion time. This means they are not present in the import CSV file, but are generated from a transformation using any of the built-in functions in ClickHouse.

Materialized columns will save you to re-running costly ETL processes that can take hours in your end. Instead of that, you can easily calculate or transform your data on the fly as data is ingested in a data source.

How to use MATERIALIZED columns

Let's see an example. In the Intro to ingesting data guide, we create da Data Source with this schema:

Then, in the Intro to transforming data guide, we created a Pipe named ``ecommerce_example`` where we did some transformations to the original data. For example, we extracted a couple of fields from a JSON column in the ``events`` Data Source:

Extracting some data from a JSON field

This operation alone takes almost 1.5 seconds. If this is done on query time, it'll add that time whenever the endpoint is called and that's not acceptable if this endpoint is going to be called many times or we need a low latency - which we'll assume is the case for this guide.

We also extracted the date from a ``DateTime`` field in another node of the pipe:

And this operation also takes more than 1 second

To make our queries as fast as possible, we should avoid doing expensive calculations on query time. A better option is to have those fields pre-calculated and store them along with the rest of the columns in disk. Our data will take up more space in disk, but if we're building a real-time application that needs to return data with low latencies, this is a necessary trade-off to be made. And this is exactly what materialized columns allow us to do.

The solution

We'll define a new Data Source with three materialized columns to avoid doing those computations on query time and therefore speed up queries.

And then we reingest all the data. Remember that we're ingesting 100M rows and each one of these 2 CSVs take ~5GB.

From now on, you can use the new ``price``, ``city`` and ``day`` materialized columns in your analyses.

{% tip-box title="Materialized columns need to be selected explicitly" %}If you run ``SELECT * FROM events_mat``, the materialized columns won't show up. You need to add their names explicitly in the ``select`` query. They will also not show up when you download the schema of the data source via UI or doing ``tb pull``, but they're there{% tip-box-end %}

The {% code-line %}MATERIALIZED{% code-line-end %} expressions allows for many transformations, see the built-in functions for a complete list of them, but the most common ones are:

  • Casting to a different data type (toFloat, toInt, toDate, ...)
  • Concat several fields
  • Substring a field
  • Parsing a JSON object, a date or a wrongly formatted column
  • Indexing some data using a geospatial function or other type of hash

Note the {% code-line %}MATERIALIZED{% code-line-end %} expression has to be defined in the {% code-line %}schema{% code-line-end %} definition of a data source, thus has to be set on data source creation.

See the API reference to learn how to create a data source with schema from the Data Source API.

Alternativelly check the CLI documentation to learn how to define a data source schema and push it to your Tinybird account.

{% tip-box title="Materialized columns creation and visibility" %}Materialized columns cannot be created from the UI, neither are shown from the data source preview.{% tip-box-end %}