Syncing data with cron jobs or GitHub actions

Cronjobs are the universal way of scheduling tasks. This guide shows how you can maintain your data in sync with cronjobs or GitHub actions and the Tinybird Rest API

Let's assume you have already imported a Data Source to your Tinybird account and that you have properly defined its schema and its partition key. Once everything is set, you can easily perform some operations seamlessly using the Data Source API to periodically append to or replace data in your Data Sources. In this guide, you will learn some examples.

Using crontab

Crontab is a native UNIX tool that schedules commands execution at a specified time or time interval. It works by defining in a text file the schedule, and the command to execute, which can usually done with {% code-line %}sudo crontab -e{% code-line-end %}. You can learn more about using crontab in many places on the internet.

The cron table format

This is the cron table format, but you can use external tools that help you define the cron jobs schedules:

These would be typical cron schedules to execute a command:

  • Every five minutes: {% code-line %}0/5 * * * *{% code-line-end %}
  • Every day at midnight: {% code-line %}0 0 * * *{% code-line-end %}
  • Every first day of month: {% code-line %}* * 1 * *{% code-line-end %}
  • Every Sunday at midnight: {% code-line %}0 0 * * 0{% code-line-end %}

Appending Data periodically

It's very common to have a Data Source that grows over time such as the one we've used in previous guides for e-commerce events. In this cases, very often there is an ETL process extracting this data from the transactional database and generating CSV files with the last X hours or days of data, therefore you might want to append those recently generated rows to your Tinybird Data Source. Imagine you generate new CSV files every day at 00:00 that you want to append to Tinybird everyday at 00:10.

With a shell script

You would first need to create a shell script file containing the Tinybird API request operation:

and then just add a new line to your crontab file

{% tip-box title="BASICS ABOUT CRONTAB" %}Type {% code-line %}sudo crontab -e{% code-line-end %} in your terminal to start adding your cronjobs.{% tip-box-end %}

Using GitHub actions

If your project is hosted on GitHub, you can also use GitHub actions to schedule periodic jobs. Create a new file called ``.github/workflows/append.yml`` with this code to append data from a CSV given its URL every day at 00:10

Replacing Data periodically

Think again about your events Data Source, but now imagine a scenario where you want to replace the whole Data Source with a CSV file sitting in a publicly accessible URL every first day of the month.

With a shell script

and then edit the crontab file which will take care of executing your script periodically. This can be done by typing {% code-line %}sudo crontab -e{% code-line-end %} in your terminal

{% tip-box title="BE SURE YOU SAVE YOUR SCRIPTS IN THE RIGHT LOCATION" %}To be sure it works, save your shell scripts in the {% code-line %}/opt/cronjobs/{% code-line-end %}folder.{% tip-box-end %}

With GitHub actions

Create a new file called ``.github/workflows/replace.yml`` with this code to replace all your data with given the URL of the CSV with the new data every day at 00:10

Replacing just one month of data

Having your API call inside a shell script allows you to script more complex ingestion processes, for example, imagine that you want to replace the last month of events data, every day. Then each day, you would export a CSV file to a publicly accessible URL and name it something like{% code-line %}events_YYYY-MM-DD.csv{% code-line-end %}.

With a shell script

For doing so, you could script a process that would do a conditional data replacement as follows:

Then, after saving that file to {% code-line %}/opt/cronjobs/daily_replace.sh{% code-line-end %}, you should add the following line to {% code-line %}crontab{% code-line-end %} to run it every day at midnight

With GitHub actions

Create a new file called ``.github/workflows/replace_last_month.yml`` with this code to replace all the data for the last month every day at 00:10.

{% tip-box title="Use GitHub secrets" %}Store ``TOKEN`` as an encrypted secret to avoid hardcoding secret keys in your repositories, and replace ``DATASOURCE``, ``CSV_URL`` by their values or save them as secrets as well.{% tip-box-end %}