Example of Materialized View (CLI)

Consider an events Data Source which, for each action performed in an ecommerce website, stores a timestamp, the user that performed the action, the product, which type of action - buy, add to cart, view, and so on - and a JSON column containing some metadata, such as the price.

The events Data Source is expected to store billions of rows per month. Its data schema is as follows:

DEFINITION OF THE EVENTS.DATASOURCE FILE
SCHEMA >
    `date` DateTime,
    `product_id` String,
    `user_id` Int64,
    `event` String,
    `extra_data` String

ENGINE "MergeTree"
ENGINE_PARTITION_KEY "toYear(date)"
ENGINE_SORTING_KEY "date, cityHash64(extra_data)"
ENGINE_SAMPLING_KEY "cityHash64(extra_data)"

You want to publish an API Endpoint calculating the top 10 products in terms of sales for a date range ranked by total amount sold. Here's where Materialized Views can help you.

Materialize the results

After doing the desired transformations, set the TYPE parameter to materialized and add the name of the Data Source, which materializes the results.

DEFINITION OF THE TOP PRODUCT PER_DAY.PIPE
NODE only_buy_events
DESCRIPTION >
    filters all the buy events

SQL >
    SELECT
        toDate(date) AS date,
        product_id,
        JSONExtractFloat(extra_data, 'price') AS price
    FROM events
    WHERE event = 'buy'

NODE top_per_day
SQL >
    SELECT
        date,
        topKState(10)(product_id) AS top_10,
        sumState(price) AS total_sales
    FROM only_buy_events
    GROUP BY date

TYPE materialized
DATASOURCE top_products_view

Do the rest in the Data Source schema definition for the Materialized View, named top_products_view:

DEFINITION OF THE TOP PRODUCTS VIEW.DATASOURCE FILE
SCHEMA >
    `date` Date,
    `top_10` AggregateFunction(topK(10), String),
    `total_sales` AggregateFunction(sum, Float64)

ENGINE "AggregatingMergeTree"
ENGINE_SORTING_KEY "date"

The destination Data Source uses an AggregatingMergeTree engine, which for each date stores the corresponding AggregateFunction for the top 10 products and the total sales.

Having the data precalculated as it gets ingested makes the API Endpoint run in real time, no matter the number of rows in the events Data Source.

As for the Pipe used to build the API Endpoint, top_products_agg, it's as follows:

DEFINITION OF THE TOP PRODUCTS PER DAY PIPE
NODE top_products_day
SQL >
    SELECT
        date,
        topKMerge(10)(top_10) AS top_10,
        sumMerge(total_sales) AS total_sales
    FROM dev__top_products_view
    GROUP BY date

When preaggregating, the Aggregate Function uses the mode State, while when getting the calculation it makes use of Merge.

Push to Tinybird

Once it's done, push everything to your Tinybird account:

PUSH YOUR PIPES AND DATA SOURCES USING THE CLI
tb push datasources/top_products_view.datasource
tb push pipes/top_product_per_day.pipe --populate
tb push endpoints/top_products_endpoint.pipe

When pushing the top_product_per_day.pipe, use the --populate flag. This causes the transformation to run in a job, and the Materialized View top_products_view to be populated.

You can repopulate Materialized Views at any moment:

Command to force populate the materialized view
tb push pipes/top_product_per_day.pipe --populate --force

Using On-Demand Compute for Populates

For large datasets or when your workspace is under heavy load, you can use dedicated compute instances for populate operations by adding the --on-demand-compute flag:

Populate using on-demand compute
tb push pipes/top_product_per_day.pipe --populate --on-demand-compute
Force populate using on-demand compute
tb push pipes/top_product_per_day.pipe --populate --force --on-demand-compute

This runs the populate operation on isolated compute resources, eliminating resource contention with your live queries and ensuring better performance for both the populate job and your production workloads. See Compute-Compute Separation for Populates for more details on pricing and benefits.

Updated