Intro to transforming data with Tinybird

Tinybird transforms your data on-the-fly, either at ingestion or at request time. This guide summarizes how to do it using Pipes and SQL queries.

If you've followed the Intro to ingesting data guide, you should have two Data Sources already in your account. Now you're ready to create Pipes and start exploring and transforming your raw data.

What's a Pipe?

If you've used Jupyter Notebooks, you'll be familiar with Pipes. A Pipe is a series of transformation nodes, where you can work with your data using SQL and see the results iteratively, and in real-time.

Each pipe can have multiple transformation nodes, that behave similarly to cells in a notebook. And nodes in a Pipe can refer to the results of previous nodes (think of them as subqueries) or just query raw Data Sources.

Pipes are a great way to keep your queries clean and short so that they're easy to maintain and debug. It allows for a smoother experience when working with (big) data – it helps you also identify parts of the queries that are performing worse – and, ultimately, make the process much more enjoyable.

You can create new Pipes from the Data Source modal window, the sidebar and your Dashboard view

Clicking on the + button in the Pipes section of the lateral bar, you'll create a new one. Give it a name like {% code-line %}ecommerce_example{% code-line-end %}, for example.

{% tip-box title="GETTING STARTED PIPE" %}By default, you'll have a Getting Started Pipe in your account where we introduce to some analysis you can make on the NYC Taxi Trip Duration dataset using Tinybird, and to some of the functions available for you to use. You can edit the code in every node, run it and see the results immediately.{% tip-box-end %}

Exploring data with the UI

Most of the SQL syntax you already know also works with Tinybird. To get started, let's create an empty Pipe by clicking on the small "+" sidebar button, and rename the pipe as "Ecommerce". After doing so, let's do a first query to explore what's in the {% code-line %}products{% code-line-end %} Data Source.

Clicking on the run button, will show you the results of your query below together with some very useful information about the rows you have processed, the memory you've used and the duration of the query.

Apart from the query performance information, you can see quick actions for your transformation node right below your query results

Now, let's create another transformation node and do the exactly the same with the {% code-line %}events{% code-line-end %} Data Source.

Under normal circumstances, you would see that querying both Data Sources takes less than 100ms. Given how fast it is, we recommend you to expend a while doing some simple filter or aggregation queries to better see the shape of your data, such as looking at the number of events per day:

or ranking the different user_ids by number of events:

Joining Data Sources

Joining Data Sources at scale is one of the most challenging operations in analytical platforms, and Tinybird excels at that. Let's now enrich the events data with info about the product related to each event by performing a simple JOIN.

As you can see, joining 100M events with 2M products takes around a second or two in Tinybird. To make this query about 100X faster, learn how to use the Join engine in this guide.

{% tip-box title="OTHER TYPES OF JOINS" %}Clickhouse supports all standard SQL JOIN types.{% tip-box-end %}

Calculated columns

ClickHouse comes lots of powerful functions that will let us transform data in multiple ways. With Tinybird, you can use them all, so we encourage you to check out ClickHouse docs on everything that's available for you to use. Let's see a few ones below.

JSON functions

As you might have seen, the {% code-line %}events{% code-line-end %} data source has a column containing JSON data for locations and prices. We can easily extract the price and the city values to a column by using the {% code-line %}JSONExtract{% code-line-end %} functions.

{% tip-box title="WORKING WITH JSON COLUMNS" %}Clickhouse comes with a set of functions to work with JSON data, which is specially useful when working with Data Streams or events data.{% tip-box-end %}

Date functions

ClickHouse also comes an extensive list of functions to transform dates and times. As you have seen above, extracting the date part from a Datetime column is easy and fast, and very useful when doing time filters:

Using materialized columns

Operations like these can be done on-the-fly at a decent performance as you can see, but it may be a wise decision to use Materialized columns, so that this transformation happens only once - on ingestion time - and the results are available as soon as your data hits Tinybird. Check out the guide on materialized columns and also this one on Transformation Pipes to learn more about these more advanced techniques.