Feb 21, 2025

The perfect data ingestion API design

If you ask me, this is pretty much perfect.
Javier Santana
Co-founder

Want to stay updated?
Register for the Tinybird newsletter.

The perfect data ingestion API design... does not exist 🙂.

I used the title to catch your attention, but I do think I’ve designed something close to perfect. Check it out and tell me what you'd change.

Easy to use

You can do that with any programming language in a few lines of code.

A format for the web

It accepts NDJSON and JSON. Maybe I'd add support for Parquet, but I think compressed NDJSON is good enough. 

Being web-compatible allows you to connect almost any kind of webhook. Or send it from a JavaScript snippet.

Schema >>> schemaless

When working with a lot of data, schemaless is a waste of money and resources, both on storage and processing. The API transforms the attributes into columns (that are stored with the right type in a columnar database) leading to 10x-100x improvements in both.

You can always save the raw data to process it later but, in general, it’s a bad idea.

ACK

The API sends you an ack when the data is received and safely stored. You can forget about it, you know it will eventually be written to the database.

Failing gracefully

Things fail, and this is the most interesting part. If you fail inserting data, you want to know with 100% certainty. If your app dies while you are pushing data, should you retry?

The API is idempotent. You can retry within a 5 hour window and if the data was inserted, it’s not inserted again as long as you send the same data batch (it uses a hash of the data to know if it was inserted).

The first layer of the API is so simple, so if something does fail internally, in almost every case at least the data is buffered.

Buffering

Speaking of buffering... the API does buffer data. This is generally good performance hygiene for an ingestion API, but it’s also critical if you have an analytical database (as we do). These databases aren't build to accept streaming inserts; they need to insert data in batches, otherwise it’s too expensive (both on CPU and S3 write operations).

This buffer layer also works as the safety net when things fail. For example, overloading a database is quite easy, this helps you to mitigate that without even noticing.

Scale

You can throw 1000 QPS with one event each or 200 QPS with a 50Mb payload. Even if you have a lot of data, that handles at least 99% of use cases.

Real time

Even with some buffering, it works in real time. It usually takes no more than 4 seconds for the data to be available to query from the database, but even that can be reduced to close to a second. 

And in general, it just works.

Try it

What do you think? Is it the perfect data ingestion API? Try it out and let me know.

Want to stay updated?
Register for the Tinybird newsletter.

Do you like this post?

Related posts

Changelog #18: High-frequency ingestion, handling NDJSON files and more product enhancements

Tinybird

Team

Feb 01, 2022
Iterating terabyte-sized ClickHouse®️ tables in production
A new way to create intermediate Data Sources in Tinybird

Tinybird

Team

Jun 15, 2023
New feature: add column to a Data Source

Tinybird

Team

May 25, 2021
A big performance boost, adding columns and more

Tinybird

Team

May 31, 2021
Iterating terabyte-sized ClickHouse®️ tables in production
Changelog: Revamping the API endpoints workflow and boosting your productivity
Changelog #20: Data Source descriptions and beta testing of Parquet ingestion

Tinybird

Team

Apr 13, 2022
What are columnar databases? Here are 35 examples.
Changelog: New API endpoints pages for easier integration, upgraded version of ClickHouse and more

Build fast data products, faster.

Try Tinybird and bring your data sources together and enable engineers to build with data in minutes. No credit card required, free to get started.