Using Quarantine Data Sources

In this guide you'll learn the basics about the Quarantine Data Source, and how to use it to detect and fix errors on any of your Data Sources. Both in development and production environments.

The quarantine Data Source

Every data source you create in your account has a quarantine Data Source associated. When the Data Sources API finds rows in your CSV that don't fit the data source schema, they are automatically sent to the quarantine table. This way, the whole ingestion process doesn't fail, and you can review rows quarantined later or perform operations on them in real-time using Pipes. This is a great source of information for you to fix your origin ETL, or a very powerful way to do the needed changes on-the-fly during the ingestion process.

By convention, the quarantine Data Source is named {% code-line %}{datasource_name}_quarantine{% code-line-end %}.

{% tip-box title="AUTH TOKENS OVER QUARANTINE DATA SOURCES" %}Adding a particular Data Source scope to an Auth Token, means giving the same permissions to its associated quarantine Data Source. See the Auth Token API for more information about authorization.{% tip-box-end %}

A practical example

For this guide, you can use events Data Source mentioned in our Getting Started guides or just create a new one using this CSV file. Anyhow, you will end up with a Data Source pretty similar to this:

Data Source details can be accessed from your Sidebar

Under the hood, the Data Source schema would be auto-guessed and defined as follows:

Now, we will do an incremental update by appending a few more rows with a new CSV file to our Data Source. Some rows on the file contains wrong data, so it fails.

Quarantine Data Source is always accessible (if it contains any rows) from the Data Source modal window

{% tip-box title="CHECK OUT THE OPERATIONS LOG TAB" %}Within the Data Source modal window you can also find the Operations Log, where you can see details about any operation performed over that Data Source.{% tip-box-end %}

As the UI shows, we have 4 rows in quarantine for a variety of reasons. Let's take a look at the detail by looking at the quarantine table and see how to fix those.

Within the Quarantine view you can see both, a summary of errors and the rows that have failed

In the Error view you will see a summary of all the errors and the number of occurrences for each of those, so you can prioritize fixing the most common ones, for example. In the Row view, you will see all the rows that have failed so you can further investigate why.

Common fixes for rows in quarantine

As mentioned above, there are three ways of fixing quarantine errors. First, you can always use the Quarantine Data Source to detect potential problems when generating the data, and then fix those in the origin.

On other occasions, especially when you don't control the way data is generated, it is better to change the original schema to support new rows with different values, or if you can't do that easily, the best approach is to use Pipes to generate intermediate views of your data consolidating both rows that would go to quarantine and rows that go straight to your Data Source.

Changing the original Data Schema

As you can see in the Quarantine table, all the problems were due to the wrong data types. The simplest solution would be to modify the schema generation at ingest time and recreate the original Data Source. For the errors we have, we could have changed the schema using the CLI or the User Interface to something like:

As you can see:

  • By setting the date column to {% code-line %}String{% code-line-end %} we can accommodate dates in other formats that we could eventually fix,
  • and by changing the user_id column to {% code-line %}Nullable(String){% code-line-end %} we would support null values and non-numeric formats in our Data Source.

Fixing errors on-the-fly using Pipes and Materialized views

This is one of the most powerful capabilities of Tinybird. When for whatever reason you aren't able to influence the way data is generated, or you just can't recreate the Data Sources easily, you can apply a transformation to the erroring columns at ingestion time and materialize the results on another Data Source.

We've written two guides where you can learn more about materialized columns and transformation pipes. Check them out!

{% tip-box title="IT'S ALWAYS BETTER TO FIX DATA AT THE SOURCE" %}Data Sources with wrong rows that end up in the quarantine table take more time to ingest than data sources with no errors, so for an optimal ingestion performance it's better if you fix all the quarantine errors you find during the development phase of your endpoints.{% tip-box-end %}