Materialized views in ClickHouse® only process new data, leaving historical data unprocessed. Backfilling this historical data is challenging at scale due to resource consumption, time constraints, and data consistency risks.
This post covers best practices for safe backfills including query optimization, settings tuning, and intelligent partitioning. It also explores the operational burden of DIY backfills (monitoring, incident response, partial data cleanup) versus Tinybird's managed approach (atomic backfills with auxiliary tables, automatic settings tuning, on-demand compute isolation, and intelligent retry logic). For large-scale production backfills, the engineering time saved and risk reduction typically justify managed solutions.
Understanding the Backfill Challenge
Backfilling a materialized view isn't difficult because the concept is complex, it's difficult because of the scale and the risks involved.
What's at stake here is cluster stability, data consistency, and production uptime. A poorly executed backfill can bring down your entire ClickHouse cluster, corrupt your data with partial inserts, and cost you engineering time.
Every backfill involves balancing three competing priorities: Speed, Safety, and Resource Impact.
You can optimize for any two, but rarely all three simultaneously. A fast backfill consumes more resources and increases risk. A safe backfill takes longer. A backfill that doesn't impact production needs dedicated infrastructure.
The key to successful backfills is finding the right balance of these three factors for your specific situation.
Real-World Example: From 8 Days to 4 hours
Over the past year, we've encountered numerous optimization and parallelization challenges. One case in particular illustrates the important impact of proper backfill optimization.
A customer needed to backfill a materialized view containing complex aggregations across several terabytes of historical data. The initial estimate: 8 days of continuous processing. This timeline was unacceptable. Running heavy queries for 8 days would significantly impact production performance. The longer runtime increased the probability of failures requiring a restart from scratch. And when launching new features, waiting over a week for historical data simply wasn't viable.
The first attempt used a straightforward sequential INSERT...SELECT strategy with some tweaked ClickHouse settings. Processing data month by month, each month took 6-8 hours to complete. The aggregations included heavy operations like quantile() for percentile calculations, consuming massive amounts of memory and CPU.
By tuning settings like block size, read and insert threads, and running multiple queries in parallel using several on-demand replicas for the backfill operation, we reduced the total time to just 4 hours—a 48x improvement.
The result: zero impact on production resources, guaranteed data consistency across the chain of materializations, and a fully automated process that required no manual intervention or monitoring.
Why Backfills Are Hard
Resource Consumption
Backfilling data with or without aggregations need to read all data from your disk or Cloud Storage, and heavy aggregations like quantiles, uniqueExact, or groupArray can consume hundreds of gigabytes of memory per query. CPU usage would also spike across all cores, and disk I/O might be saturated. Network bandwidth would also take a hit. All of these resources are shared with your production queries, and could take your cluster down.
Time Constraints
An unoptimized approach to backfilling might take more days than expected. Your new use case or analytics couldn't wait that long for historical data. And also, the longer the backfill runs, the higher probability of failures.
Data Consistency
This is a silent killer. When a backfill query fails midway through execution (and it probably will), ClickHouse may have already written some data to your materialized view table. You will have to deal with incomplete data that's nearly impossible to detect without data consistency checks. Your metrics are wrong, but you don't know it. Your clients or business make decisions based on bad data.
General recommendations for Safe Backfills
Before you start any backfill, you need to set yourself up for success. Some of these recommendations apply whether you're doing it yourself or using a managed solution. Tinybird abstracts some of the burdens for you to focus on the feature development.
Recommendation #1: Optimize your query first
Don't start backfilling until you've optimized the query itself. A 10x query improvement means 10x faster backfills, and 10x less resource consumption. Whether the improvement is on memory or CPU usage, the time you invest in optimization pays dividends across every single chunk of your backfill.
Depending on your query, there might be different ways to optimize it. Pay attention to the operations you're doing in the query, and avoid multiple passes over the data. For example, if you're calculating multiple quantiles, do it in a single pass with quantiles() function instead of using quantile() function N times. If you can, use simpler functions to extract attributes from JSON like simpleJSONExtractString vs JSONExtractString.
There are other optimizations that have trade-offs that could be or not be acceptable for your use case, like using uniq() instead of uniqExact() when approximate counts are acceptable, or using quantileTDigest() instead of quantileExact() where approximation is acceptable.
Recommendation #2: Configure Settings for Your Query Type
ClickHouse's default settings are optimized for fast OLAP queries, not long-running backfills. You need to tune settings based on your specific query pattern and cluster resources.
There are several areas to optimize:
Memory Management
SET max_memory_usage = 50000000000; -- 50GB per query
SET max_bytes_before_external_group_by = 40000000000; -- 40GB before spilling to disk
SET max_bytes_before_external_sort = 40000000000; -- 40GB before external sort
For heavy aggregations, set max_memory_usage based on your available RAM minus overhead for OS and other processes. If your aggregations are memory-intensive (like uniqExact or groupArray), configure external aggregation thresholds (max_bytes_before_external_group_by, max_bytes_before_external_sort)to spill to disk before hitting memory limits. This prevents OOM kills but trades speed for safety.
Thread Configuration
SET max_threads = X; -- Limit parallelism within a single query
SET max_insert_threads = X; -- Threads for writing data
Counter-intuitively, fewer threads during backfills can be better. More threads mean more memory per thread and more CPU contention. Start conservative and increase only if CPU usage is low. If running multiple backfill queries in parallel, use fewer threads per query to avoid oversubscription.
Block Size Settings
SET min_insert_block_size_rows = X; -- Min rows before writing to disk
SET min_insert_block_size_bytes = X; -- Min bytes before writing
Larger insert blocks mean fewer parts created on disk, which reduces merge overhead. For backfills, increase min_insert_block_size_rows and min_insert_block_size_bytes to create fewer, larger parts. However, larger blocks also mean more memory usage during insert. Balance based on your available memory.
Timeout Settings
SET max_execution_time = X; -- 2 hours max per query
SET send_timeout = 600; -- 10 minutes (only for remote tables)
SET receive_timeout = 600; -- 10 minutes (only for remote tables)
Backfills take time. Increase max_execution_time based on your chunk size estimates. Better to let a query run longer than to fail and retry with partial data. Adjust network timeouts if working with distributed tables or remote clusters.
Insert-Specific Settings
SET insert_deduplicate = 0; -- Skip deduplication checks
SET async_insert = 0; -- Synchronous inserts for backfills
For backfills where you control uniqueness, disable insert_deduplicate to skip expensive duplicate checks. Always use async_insert = 0 for backfills to ensure you know exactly when each chunk completes.
If your aggregation's GROUP BY matches your table's ORDER BY, enable optimize_aggregation_in_order for massive memory savings.
The key is testing these settings on a small chunk first, monitoring memory and CPU usage, then adjusting before scaling to your full backfill.
Recommendation #3: Split or Parallelize Intelligently
A single massive INSERT...SELECT statement is a recipe for disaster. It might be slow, it is risky, and impossible to resume if it fails. The solution is to split your backfill into manageable partitions, and run in parallel if possible.
But splitting isn't as simple as you might think, you need to split intelligently based on your data's natural boundaries, and you need guardrails to prevent resource exhaustion.
Strategy A: Split by Partition Key (Time-based)
Most ClickHouse tables are partitioned by date or time. This is your natural splitting point because:
- Partitions are physically separate on disk
- Each chunk reads from independent files
- No risk of lock contention
- Easy to track progress
-- For a table partitioned by month
-- Chunk 1: January 2024
INSERT INTO events_mv
SELECT * FROM events
WHERE toYYYYMM(timestamp) = 202401;
-- Chunk 2: February 2024
INSERT INTO events_mv
SELECT * FROM events
WHERE toYYYYMM(timestamp) = 202402;
Strategy B: Split by Categorical Column
When your table doesn't have a time dimension, or when time-based chunks are still too large, split by a categorical column. Preferably the first column in your Sorting Key (ORDER BY clause).
-- Split by user segment
INSERT INTO user_aggregates_mv
SELECT * FROM events
WHERE user_segment = 'enterprise';
INSERT INTO user_aggregates_mv
SELECT * FROM events
WHERE user_segment = 'smb';
You can combine both of them to have smaller chunks if needed.
The DIY Way: Risks and Pitfalls
Once you have optimized your query and assessed performance, it's time to run it in your database instance. It's important to understand what you're signing up for, and the risks you'll need to manage.
The reality is that most teams underestimate the operational burden. What seems like "just run some INSERT queries" turns into:
- 24/7 monitoring during the backfill period
- Incident response when things go wrong
- Coordination with other engineering teams
- Extended timelines when you need to pause for production events
Risk #1: Resource Exhaustion and Query Failures
Even with careful tuning, backfill queries can consume enormous resources. Memory usage is particularly unpredictable because it depends on data distribution, which you may not fully understand until queries start running.
OOM (Out of Memory) Kills
Code: 241. DB::Exception: Memory limit (for query) exceeded:
would use 51.37 GiB (attempt to allocate chunk of 4194304 bytes),
maximum: 50.00 GiB
Your query exceeded max_memory_usage, or the Linux OOM killer terminated the process. Either way, you're left with partial data, wasted time, and potentially an unstable cluster.
The challenge: you set max_memory_usage = 50GB based on testing, but production data in July had 10x more unique users than your test data in January. Your uniq() aggregation consumed far more memory than expected.
Query Timeouts
Code: 159. DB::Exception: Timeout exceeded: elapsed 3601.234 seconds,
maximum: 3600.000
Your query took longer than max_execution_time. Maybe you need to split into smaller chunks. But which dimension should you split on? Time? Category? Hash?
CPU Saturation
Heavy aggregations can peg all CPUs at 100% for minutes or hours. Your production queries start queueing. Dashboard load times go from 2 seconds to 30 seconds. Users complain. Your monitoring alerts fire. Someone gets paged.
And the backfill? Still running. You're 40% through and can't stop now.
Risk #2: Partial Data and Consistency Issues
You've split your backfill into 100 partitions. After two days, 97 have succeeded and 3 have failed. Sounds great, right? But here's the problem: how do you know which ones failed? How do you verify each partition actually wrote the correct number of rows? What if one succeeded but only wrote 80% of the data before timing out?
ClickHouse inserts are not atomic by default across multiple blocks. A failed query may have written 15 out of 20 blocks. There's no built-in rollback mechanism. You're responsible for:
- Detecting these partial blocks
- Determining exactly what was written
- Removing only the partial data (without removing good data)
- Retrying the chunk
Risk #3: Impact on Production Infrastructure
Your backfill queries compete with production workloads for the same resources: CPU, memory, disk I/O, network bandwidth, and replication capacity. Even a "carefully tuned" backfill affects production.
Memory Pressure
Your 4 parallel backfill queries are using 160GB total (40GB each). Your production cluster has 256GB RAM per node. That leaves 96GB for production queries, OS cache, and background processes. Normally you have 256GB. Production queries that used to comfortably fit in memory now spill to disk or hit limits.
Disk I/O Contention
Backfills read massive amounts of data from disk. If your production queries also need to read from disk (cache misses), they're now competing for the same SSD read IOPS. Latency increases. Queries that took 500ms now take 2 seconds.
Replication Lag
Your heavy INSERT load creates a replication backlog. Replicas fall behind by minutes or hours. During this time:
- Replicas serve stale data to queries
- If the primary fails, you lose data
- When backfill completes, replicas spend hours catching up
Merge Storms
Each INSERT creates new parts on disk. ClickHouse's background merge process combines small parts into larger ones. But backfills create parts faster than merges can keep up. Part count grows.
The Tinybird Way: Managed Backfills
At Tinybird, we've learned how to do backfills safely at massive scale, encountered failures, debugged performance issues, and applied our learnings from that. We've automated the entire process so you don't have to think about any of this.
Feature #1: Auxiliary Table Pattern for Atomic Backfills
The core problem with partial inserts is that failed queries leave inconsistent data in your production materialized view. Our solution ensures backfills are all-or-nothing for each partition using an auxiliary table pattern.
For each backfill partition, we create a temporary staging table, insert all the chunks into it, and move to the production table only if the data is complete.
This way, your production tables always have complete and consistent data without you having to worry about it. For you as a customer, this complexity is invisible. We handle the entire auxiliary table flow automatically. If any partition fails, we automatically retry it without affecting successful chunks. Your production data is not at risk.
Feature #2: Intelligent Settings Tuning
ClickHouse has a lot of settings that affect query performance and resource usage. Getting them right requires deep expertise and depends on multiple factors:
- Your query pattern (simple selects vs. heavy aggregations)
- Your data characteristics (cardinality, distribution, skew)
- Your cluster configuration (memory, CPU, disk type)
We've built an engine that analyzes all of these factors and automatically determines optimal settings for your specific backfill.
Feature #3: On-Demand Backfill Replicas
Even with perfect query optimization and settings, backfills consume significant resources. Running heavy aggregations for hours or days affects production query performance. Our solution: spawn temporary, dedicated replicas just for backfills.
In Forward, On-Demand replicas for deployment backfills are automatically launched when your tables exceed 100M rows, or weigh more than 50GBs.
If you are running a Materialized View populate with Tinybird CLI in Classic, you can add --on-demand-compute flag to your tb pipe populate command.
With a dedicated replica:
- Backfill queries use 100% of replica resources (no competition)
- Zero impact on production query latency
- No memory pressure on production nodes
- Can run more aggressive parallelization
- Faster backfill completion
Feature #4: Automatic Partition splitting
Tinybird automatically handles partition splitting by querying system.parts to discover your table's partition structure and splitting the backfill accordingly when table is partitioned by date.
The system adapts to your actual data structure and handles the entire splitting process. If your table has 200 monthly partitions, we will insert data with 200 INSERT queries.
This automatic discovery also ensures we never miss data. We're not guessing at date ranges or partition values; we're reading directly from ClickHouse's metadata to know exactly what exists.
Feature #5: Intelligent Retry and Error Handling
Backfills fail. Queries timeout, nodes run out of memory, networks hiccup. The question isn't whether failures will happen, it's how your system handles them.
Whether the query failed due to a temporary problem, or due to too much load because of the settings and unexpected events, Tinybird automatically retries the partition insert. It lowers setting values if using too much resources, or retries with same settings if issue comes from another place.
Does it make sense to DIY?
Backfilling materialized views in ClickHouse is a solved problem, but solving it yourself requires significant engineering effort and ClickHouse expertise.
For one-time small backfills (under 100GB, simple queries, measured in minutes), the DIY approach with careful attention to the recommendations in this post will serve you well. Use the chunking strategies, settings guidance, and validation techniques we've covered.
For large-scale, production-critical backfills (especially if you're doing them regularly) the operational burden and risk typically make managed backfills worth the investment. The time savings, risk reduction, and engineering focus you gain usually justify the cost many times over.
Ready to Backfill Your Materialized Views?
If you're planning a large-scale backfill and want to avoid the operational headaches check out our documentation or try Tinybird for free.
