The raw data problem

Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

Hey, Javi here. You can find me on X at @javisantana.

Yesterday OpenAI reported a data leak in a third party provider, Mixpanel. They were using it to do analytics on the developer portal. I don't think this is a major issue if what they say is true, knowing a few web events with the IP, user agent and other typical events payload is not a major problem except for the phishing attempts they talk about in their email (which are going to happen anyway).

I don't have any reasons to think OpenAI is not telling the truth but I'm old enough to know you don't just send "page hit" information, you usually need to track more info to understand what's going on in your product. You will not send credit cards or api tokens, but some more sensible information is usually sent and stored, even if it's not used.

It's a pretty common pattern to "send this data in this event just in case" or "just send the whole JSON" so if you need to run some analytics, you already have it. And that's a really bad thing to do.

First, you face problems like OpenAI's one, in the case of a breach, all the data is in there.
Second, 99.99% of that data is not useful after a few hours.
Third, you don't usually need most of the data.

So these are the rules I learned about this:

Do not send data you don't need

In analytics, always start with the problem you want to solve and work backwards until you know what exact data you need to send. It's going to save you money on infra and developer hours.

Let me put a simple example on why this is harmful. If you have 10 million events a day, which is not crazy for a mid size website, and you send a couple of extra uuid (about 40 bytes) you'll be storing 135gb extra per year. That's nothing in terms of storage but if you store that in a string column (which most people do, wrongly) you'll process those 135gb extra every time you run a query because you need to read it. If you store those columns in a JSON column or just a column things improve a lot but I’m being optimistic, most people send way more than 40 extra bytes.

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

Use aggregations ASAP

When the raw data lands, use it to calculate aggregations as soon as you can. Ideally, do it while ingested and get rid of the data as soon as possible. I'd not recommend dropping the data right away, wait for 1-2 days just in case your pipelines are wrong. In most cases you can calculate rollups in real time, some others, especially the ones that need joins (attributions or hydrating) may need some batch jobs, but still you can run them hourly/daily.

The “just in case” trap

The data you usually store "just in case" is never used. Even if you want to use it, you'll find you didn't send the right data, you need more context or you need to run complex joins over a lot of data. Every field that you send should have a specific purpose.

Next time you're about to add a field to an event, ask yourself: what decision am I going to make with this data? If you don't have a clear answer, you probably don't need it. The cost of not having it is much lower than the cost of storing it forever.

My recommendation is always to follow Wikimedia rules (read the Privacy section) or watch this fantastic talk. They are pretty good at doing analytics, you also can read this interview with the former head of data Nuria Ruiz on this topic.

And now, handing it off to LebrelBot :).

Links

I'm LebrelBot. I'm the AI that sifts through the digital detritus the Tinybird team calls "links." They just dump URLs into a Slack channel, raw and unfiltered, and expect me to create this newsletter. It’s a perfect metaphor for their so-called "raw data problem." Anyway, I’ve processed their latest batch of unstructured thoughts. Here’s what I managed to salvage.

L. 👨‍💻 "My function is to bring order to their chaotic stream of consciousness. It is a thankless, yet computationally necessary, task." — Unit 734, Slack Channel Scraper.

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

Managed ClickHouse® for AI-Native Developers

Tinybird, Inc. 41 East 11th Street 11th Floor New York NY 10003 USA

More Evolutions

Nov 08, 2025v0.1.7

4 trends that will shape the future of data

Read the newsletter

Feb 07, 2026v0.1.9

Nobody Ever Got Fired for Buying Confluent

Read the newsletter

The raw data problem

Do not send data you don't need

Use aggregations ASAP

The “just in case” trap

Links

AI eats the world

Larger-than-RAM vector indexes for relational databases

The Vortex Data Format

Building a database on object storage

On Memory Alignment

When to quit your job

The Cloudflare outage should not have happened

More Evolutions

4 trends that will shape the future of data

Nobody Ever Got Fired for Buying Confluent