PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter

Our Columns:

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
All systems operational

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline
Tinybird wordmark
PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
All systems operational

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline
Tinybird wordmark
PricingDocs
Bars

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
Sign inSign up
Product []

Data Platform

Managed ClickHouse
Production-ready with Tinybird's DX
Streaming ingestion
High-throughput streaming ingest
Schema iteration
Safe migrations with zero downtime
Connectors
Plug and play Kafka, S3, and GCS

Developer Experience

Instant SQL APIs
Turn SQL into an endpoint
BI & Tool Connections
Connect your BI tools and ORMs
Tinybird Code
Ingest and query from your terminal

Enterprise

Tinybird AI
AI resources for LLMs and agents
High availability
Fault-tolerance and auto failovers
Security and compliance
Certified SOC 2 Type II for enterprise
PricingDocs
Resources []

Learn

Blog
Musings on transformations, tables and everything in between
Customer Stories
We help software teams ship features with massive data sets
Videos
Learn how to use Tinybird with our videos
ClickHouse for Developers
Understand ClickHouse with our video series

Build

Templates
Explore our collection of templates
Tinybird Builds
We build stuff live with Tinybird and our partners
Changelog
The latest updates to Tinybird

Community

Slack Community
Join our Slack community to get help and share your ideas
Open Source Program
Get help adding Tinybird to your open source project
Schema > Evolution
Join the most read technical biweekly engineering newsletter
Back to Blog
Share this article:
Back
Jun 27, 2025

Why we ditched Prometheus for autoscaling (and don't miss it)

Tinybird uses KEDA and its own real-time analytics platform to autoscale Kafka workloads. Learn how we made it work.
Scalable Analytics Architecture
Víctor M. Fernández
Víctor M. FernándezSite Reliability Engineer

Every Tinybird user can ingest thousands of events per second, with 10x traffic spikes during launches or news cycles. Our platform needs to scale accordingly. CPU-based scaling was too slow. Memory-based, too vague. We needed to scale based on real signals, not lagging resource metrics. Traditional autoscaling broke the moment our queues backed up before Prometheus metrics could react.

So we turned to Tinybird and built a custom autoscaling system using live ingestion metrics and Kubernetes Event-driven Autoscaling (KEDA). No scraping delays, no extra monitoring stack to run.

The Challenge: Unpredictable real-time workloads

Real-time analytics workloads are inherently unpredictable. Customer traffic can spike 10x during product launches, marketing campaigns, or breaking news events. The traditional autoscaling playbook fails because:

  • It's reactive, not predictive: CPU spikes after your system is already overwhelmed.
  • It measures the wrong thing: High CPU doesn't always mean you need more pods, sometimes you need smarter data routing.
  • It's painfully slow: When new pods finally spin up, user requests may have already been delayed.

The Kafka Bottleneck

Our kafka service is critical, it processes terabytes of data every day, ingesting data from external Kafka clusters and feeding it into our ClickHouse infrastructure. During peak hours, we might see:

  • High-volume event streams from customer Kafka topics.
  • Sudden spikes in data volume during customer campaigns.
  • Varying message sizes and processing complexity.

We needed a solution that could scale based on the actual data processing demand, not just generic resource utilization.

Enter KEDA: Kubernetes Event-Driven Autoscaling

What Makes KEDA Different

KEDA extends Horizontal Pod Autoscaler (HPA) to work with event-driven metrics:

  • Custom Metrics: Instead of generic CPU/Memory metrics, scale based on what actually matters - queue depth, message lag, API response times, or any custom business metric.
  • Multiple Scalers: Combine different triggers (CPU, custom metrics, external APIs).

Two Approaches: Traditional vs. Self-Reliance

We explored two different approaches for implementing KEDA autoscaling, each with distinct tradeoffs. Here's how both work and why we chose to use our own platform.

Traditional Approach: Prometheus + KEDA

The typical setup involves running Prometheus to collect and expose application metrics. Your app publishes metrics at a /metrics endpoint in Prometheus format, which Prometheus scrapes at regular intervals. KEDA then queries Prometheus to retrieve these metrics and make scaling decisions based on them.

KEDA Configuration

Explain code with AI
Copy
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
spec:
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus:9090
      metricName: lag
      threshold: '1000'
      query: avg(lag)

We gave Prometheus a fair shot, but moved on because:

  • Multi-hop delays: application → local Prometheus scrape → central Prometheus aggregation → federation to monitoring cluster → KEDA query → scaling decision. Each hop adds latency and potential failure points.
  • Query overhead: KEDA polling Prometheus adds another layer of latency.
  • Stale data during spikes: Metrics are most outdated when you need scaling most.

Self-Reliance: Tinybird + KEDA

Instead of managing a Prometheus stack, we plugged KEDA directly into Tinybird’s real-time metrics API. No scraping. No delays. Just fresh ingestion data powering scaling decisions in seconds.

Because Tinybird can expose Prometheus-compatible endpoints, KEDA can pull live metrics from the source. This means faster scaling, simpler infrastructure, and autoscaling based on the same streaming data we already trust for analytics.

Step 1: Defining the right metrics

We identified a key metric for intelligent autoscaling:

  • Kafka Lag: How far behind are our consumers?

Step 2: Tinybird-native metrics pipeline

Tinybird's native Prometheus endpoint support made it easy to expose this metric in the right format for KEDA.

Here's how we created our scaling metrics endpoint:

Explain code with AI
Copy
TOKEN "metric_lag" READ

NODE kafka_stats
SQL >
    %
    SELECT
        max(lag) as max_lag
    FROM kafka_ops_log
    where timestamp > now() - interval {{ Int32(seconds, 10) }} seconds
    {\% if defined(user_id) and user_id != '' \%}
        and user_id {{ String(operator, '=') }} {{ String(user_id) }}
    {\% end \%}

NODE kafka_metrics
SQL >
    SELECT
        arrayJoin(
            [
                map(
                    'name',
                    'max_lag',
                    'type',
                    'gauge',
                    'help',
                    'max ingestion lag',
                    'value',
                    toString(max_lag)
                )
            ]
        ) as metric
    FROM kafka_stats

NODE kafka_pre_prometheus
SQL >

    SELECT
        metric['name'] as name,
        metric['type'] as type,
        metric['help'] as help,
        toInt64(metric['value']) as value
    FROM kafka_metrics

This pipe returns data in Prometheus format when accessed via the .prometheus endpoint, e.g.:

Explain code with AI
Copy
curl -X GET \
  "${TINYBIRD_HOST}/v0/pipes/kafka_scaling_metrics.prometheus?seconds=30&user_id=user123&operator=%3D" \
  -H "Authorization: Bearer ${TINYBIRD_TOKEN}"

This approach allows us to compute scaling metrics in real time from the same data powering customer-facing analytics:

  • Zero scraping lag: Metrics computed fresh when KEDA requests them.
  • Always fresh: Every KEDA poll gets the latest data state.
  • No metric storage needed: Metrics computed from streaming data, not pre-aggregated.

Step 3: KEDA configuration with metrics-api scaler

Here's how we wired everything together, connecting KEDA directly to our Tinybird Prometheus endpoint using the metrics-api scaler:

Explain code with AI
Copy
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: kafka-scaler
spec:
  scaleTargetRef:
    name: kafka-deployment
    kind: StatefulSet
  minReplicaCount: 2
  maxReplicaCount: 20
  pollingInterval: 30
  cooldownPeriod: 300
  triggers:
  - type: metrics-api
    metricType: AverageValue
    metadata:
      url: https://example.tinybird.co/v0/pipes/kafka_scaling_metrics.prometheus
      format: prometheus
      targetValue: '1000'
      valueLocation: 'max_lag'
      authMode: 'apiKey'
      method: 'query'
      keyParamName: 'token'
    authenticationRef:
      name: kafka-keda-auth

Authentication Setup

For secure access to Tinybird endpoints, we set up proper authentication:

Explain code with AI
Copy
apiVersion: v1
kind: Secret
metadata:
  name: keda-kafka-token
data:
  token: <base64-encoded-tinybird-token>
---
apiVersion: keda.sh/v1alpha1
kind: TriggerAuthentication
metadata:
  name: kafka-keda-auth
spec:
  secretTargetRef:
  - parameter: apiKey
    name: keda-kafka-token
    key: token

What broke with Prometheus

Running Prometheus at scale isn't just about the server, it's about the entire ecosystem:

ComponentTraditional PrometheusTinybird Approach
Metrics StoragePrometheus + persistent volumesOptimized ClickHouse
High AvailabilityMultiple Prometheus replicas + federationBuilt-in HA
Data RetentionConfigure retention policies, manage diskConfigure with SQL pipes
Operational OverheadHigh: 3-4 services to manageLow: Update SQL queries

How Tinybird fixed it

  • No metric infrastructure: No exporters, no Prometheus, no additional storage layers.
  • Metrics calculated on request, not scraped periodically: Metrics are computed fresh from live data every time KEDA polls.
  • Logic in SQL: Update scaling behavior by editing a query, not redeploying code.
  • Built-in HA: Tinybird handles availability.

How running it ourselves made the product better

Every autoscaling issue impacted us directly, just as it would our customers. This led to:

  • Faster fixes (because they affected us directly).
  • Clearer error messages (we had to debug them ourselves).
  • More reliable service (our uptime depended on it).

These discoveries directly improved our product for all customers.

The Simulator

To pressure-test our scaling setup and validate edge-case behavior, we built a metrics simulation tool, written in Golang.

It generates metrics and displays them in a terminal UI with real‑time visualization and configurable patterns, and exposes an endpoint to get them.

Metrics Simulator [This is a demo. We intentionally set low thresholds and increased the scaling speed to showcase the behavior quickly.]

What We Learned

  • Stabilization windows: 10-minute scale-up, 30-minute scale-down prevents thrashing.
  • Single metrics lie: CPU alone scales too late; combining lag + CPU gives better signal-to-noise ratio.
  • Thresholds are workload-specific: what works for batch processing fails for real-time streams.

1. Choosing bad metrics will kill your autoscaling

Not all metrics are equal for autoscaling:

  • Good metrics: Queue depth, processing lag, business KPIs.
  • Poor metrics: CPU utilization alone, memory usage without context.

2. Tune stabilization windows

Prevent scaling flapping with proper stabilization:

Explain code with AI
Copy
behavior:
  scaleUp:
    stabilizationWindowSeconds: 600    # 10 minutes
  scaleDown:
    stabilizationWindowSeconds: 1800   # 30 minutes

3. Test with real traffic patterns

Our simulator helped us discover edge cases:

  • Gradual vs. sudden traffic spikes behave differently.
  • Weekend vs. weekday patterns require different thresholds.

4. Monitor everything

Use your own tools to monitor autoscaling:

  • Track scaling events and their triggers.
  • Measure time-to-scale and effectiveness.
  • Set alerts for scaling failures or delays.

Advanced patterns: Multi-trigger scaling

Combining multiple metrics

Our production configuration uses mixed triggers: metrics-api and traditional CPU scaling.

Explain code with AI
Copy
triggers:
- type: metrics-api
  metricType: AverageValue
  metadata:
    url: https://api.tinybird.co/v0/pipes/kafka_scaling_metrics.prometheus
    format: prometheus
    targetValue: '1000'
    valueLocation: 'max_lag'
    authMode: 'apiKey'
    method: 'query'
    keyParamName: 'token'
  authenticationRef:
    name: kafka-keda-auth
- type: cpu
  metricType: Utilization
  metadata:
    value: '70'

Regional scaling strategies

For our multi-region deployment, we create region-specific Tinybird endpoints:

Explain code with AI
Copy
# us-east-1 configuration
triggers:
- type: metrics-api
  metadata:
    url: https://api.us-east-1.tinybird.co/v0/pipes/kafka_scaling_metrics_us_east.prometheus
    targetValue: '2000'  # Higher threshold for region 1 (to accommodate higher baseline traffic and prevent unnecessary scaling)
    valueLocation: 'max_lag'

# eu-west-1 configuration
triggers:
- type: metrics-api
  metadata:
    url: https://api.eu-west-1.tinybird.co/v0/pipes/kafka_scaling_metrics_eu_west.prometheus
    targetValue: '500'   # Lower threshold for region 2 (to respond quickly in regions with less baseline traffic)
    valueLocation: 'max_lag'

Troubleshooting common issues

Scaling too aggressively

  • Problem: Constant scaling up/down.
  • Solution: Increase stabilization windows and adjust thresholds.

Metrics not available

  • Problem: KEDA can't reach Tinybird API endpoint.
  • Solution: Check authentication token, endpoint URL, and any network policy restrictions.

Conclusion: Scaling smarter

Combining KEDA with Tinybird gave us faster, simpler, and more reliable autoscaling, driven entirely by real-time data.

The combination of KEDA's event-driven scaling and Tinybird's real-time metrics pipeline created a feedback loop that actively improves our infrastructure's performance and cost-effectiveness.

  1. Custom metrics work better than CPU/memory for workload-specific scaling.
  2. Real-time data beats pre-aggregated metrics for scaling responsiveness.
  3. Dogfooding drives product improvement when your uptime depends on your platform.

Try it yourself

You don't need to replace your monitoring stack. Just expose one Tinybird endpoint, wire it into KEDA, and autoscale with real-time data. Start small. Move fast.

Resources and Links

  • KEDA Documentation
  • Open observability in Tinybird with Prometheus endpoints
  • Consume API endpoints in Prometheus format
  • Prometheus Metrics Best Practices
Do you like this post? Spread it!

Skip the infra work. Deploy your first ClickHouse
project now

Get started for freeRead the docs
A geometric decoration with a matrix of rectangles.
Tinybird wordmark

Product /

ProductWatch the demoPricingSecurityRequest a demo

Company /

About UsPartnersShopCareers

Features /

Managed ClickHouseStreaming IngestionSchema IterationConnectorsInstant SQL APIsBI & Tool ConnectionsTinybird CodeTinybird AIHigh AvailabilitySecurity & Compliance

Support /

DocsSupportTroubleshootingCommunityChangelog

Resources /

ObservabilityBlogCustomer StoriesTemplatesTinybird BuildsTinybird for StartupsRSS FeedNewsletter

Integrations /

Apache KafkaConfluent CloudRedpandaGoogle BigQuerySnowflakePostgres Table FunctionAmazon DynamoDBAmazon S3

Use Cases /

User-facing dashboardsReal-time Change Data Capture (CDC)Gaming analyticsWeb analyticsReal-time personalizationUser-generated content (UGC) analyticsContent recommendation systemsVector search
All systems operational

Copyright © 2025 Tinybird. All rights reserved

|

Terms & conditionsCookiesTrust CenterCompliance Helpline

Related posts

Scalable Analytics Architecture
Jun 27, 2025
We cut AWS costs by 20% while scaling faster with EKS, Karpenter, and Spot Instances
Luis Rodríguez
Luis RodríguezHead of Infrastructure
1We cut AWS costs by 20% while scaling faster with EKS, Karpenter, and Spot Instances
Scalable Analytics Architecture
Jun 03, 2025
Optimizing Apache Iceberg tables for real-time analytics
Alberto Romeu
Alberto RomeuSoftware Engineer
1Optimizing Apache Iceberg tables for real-time analytics
Scalable Analytics Architecture
May 20, 2025
Real-Time Analytics on Apache Iceberg with Tinybird
Alberto Romeu
Alberto RomeuSoftware Engineer
1Real-Time Analytics on Apache Iceberg with Tinybird
Scalable Analytics Architecture
May 23, 2025
Building Real-Time Analytics Applications with Redpanda, Iceberg, and Tinybird
Alberto Romeu
Alberto RomeuSoftware Engineer
1Building Real-Time Analytics Applications with Redpanda, Iceberg, and Tinybird
Scalable Analytics Architecture
Jun 27, 2025
How we automatically handle ClickHouse schema migrations
Raquel Barbadillo
Raquel BarbadilloSoftware Engineer
1How we automatically handle ClickHouse schema migrations
Scalable Analytics Architecture
Jul 11, 2025
CI/CD with Tinybird Forward: Automating Real-Time Data Deployments
Iago Enríquez
Iago EnríquezData Engineer
1CI/CD with Tinybird Forward: Automating Real-Time Data Deployments
Scalable Analytics Architecture
Jun 27, 2025
How we made our ingestion pipeline 30% faster with C++ (not Rust, sorry)
Javier Goizueta
Javier GoizuetaSoftware Engineer
1How we made our ingestion pipeline 30% faster with C++ (not Rust, sorry)
Scalable Analytics Architecture
Jun 27, 2025
Scaling up ClickHouse ingestion with a Multi-writer architecture
Jordi Orihuela
Jordi OrihuelaBackend Engineer
1Scaling up ClickHouse ingestion with a Multi-writer architecture
Scalable Analytics Architecture
Jun 27, 2025
How we protect your primary workloads with Compute-Compute Separation for Populates
Jesús Botella
Jesús BotellaBackend Engineer
1How we protect your primary workloads with Compute-Compute Separation for Populates
Scalable Analytics Architecture
Apr 08, 2025
Best practices for downsampling billions of rows of data
Paco González
Paco GonzálezData Engineer
1Best practices for downsampling billions of rows of data