Name: Tinybird
Brand: Tinybird
Rating: 5.0 (10 reviews)

When you're processing millions of URLs in ClickHouse, the difference between getting example.com and an empty string can break your entire analytics pipeline. The domainRFC() function extracts RFC 3986-compliant hostnames from URLs, handling edge cases that the simpler domain() function can't parse correctly.

This guide covers the syntax differences between ClickHouse's hostname extraction functions, performance optimization strategies for large datasets, and how to build production-ready APIs that process URL data reliably.

What makes a hostname RFC-compliant in ClickHouse

In ClickHouse, the domainRFC function extracts hostnames from URLs following the RFC 3986 standard. This function handles complex URL structures that the simpler domain function can't parse correctly.

RFC 3986 defines how URLs work (everything from the protocol (https://) to the hostname (example.com) to the path (/api/data)). When you're dealing with URLs that contain user credentials, ports, or unusual formatting, RFC compliance becomes important for getting consistent results.

Here's what I mean: if you have a URL like https://user:password@api.example.com:8080/data, the RFC-compliant approach correctly identifies api.example.com as the hostname while ignoring the authentication details and port number.

Start building with Tinybird!

If you've read this far, you might want to use Tinybird as your analytics backend. You can just get started, on the free plan.

The domainRFC function versus domain

ClickHouse gives you two options for hostname extraction, but they behave very differently. The domain() function prioritizes speed over accuracy, while domainRFC() follows web standards even when URLs get messy.

Let me show you exactly what happens with a tricky URL:

SELECT
    domain('http://user:password@example.com:8080/path') AS domain_result,
    domainRFC('http://user:password@example.com:8080/path') AS rfc_result;

domain_result	rfc_result
(empty string)	example.com

The domain() function gives up when it sees user credentials in the URL. Meanwhile, domainRFC() correctly extracts the hostname you actually want. This difference matters when you're processing real-world data where URLs come from browsers, APIs, and log files that don't always follow perfect formatting.

Syntax and minimal query to extract hostnames

The domainRFC() function takes one parameter—the URL string—and returns the hostname as a string.

1. Run a one-line select

SELECT domainRFC('https://clickhouse.com/docs/sql-reference/') AS hostname;

This returns clickhouse.com. The function automatically strips away the protocol, path, query parameters, and everything else that isn't the hostname.

2. Cast result and check nullability

The function returns a String type, but you'll get an empty string when the URL is malformed. You can check for valid results like this:

SELECT 
    url,
    domainRFC(url) AS hostname,
    hostname != '' AS is_valid
FROM my_table;

When ClickHouse can't find a valid hostname in your URL, domainRFC() returns an empty string rather than throwing an error that would break your query.

Edge cases you handle in production

Real URL data gets messy fast. Web logs contain everything from IPv6 addresses to internationalized domains to completely broken URLs that somehow still need processing.

IPv6 literals

URLs can contain IPv6 addresses wrapped in square brackets:

SELECT domainRFC('http://[2001:db8::1]:8080/path') AS ipv6_host;

The function returns [2001:db8::1], keeping the bracket notation that distinguishes IPv6 addresses from regular hostnames.

Internationalized domain names

Unicode domain names get processed according to RFC standards:

SELECT domainRFC('https://例え.テスト/path') AS unicode_domain;

The function handles internationalized domain names (IDNs) by returning the Unicode representation as it appears in the original URL.

URLs with ports

Port numbers get stripped automatically:

SELECT domainRFC('https://api.example.com:443/v1/data') AS hostname_no_port;

This returns api.example.com, removing the :443 port specification that's redundant for HTTPS URLs anyway.

Missing schemes: URLs without http:// or https:// may return empty strings
Invalid characters: Special characters get handled according to RFC encoding rules
Empty inputs: Null or empty strings return empty results without breaking your query

Performance tactics for large tables

When you're processing millions of URLs, hostname extraction can slow down your queries significantly. However, a few optimization strategies can make a significant difference in performance.

1. Create a materialized column

Pre-compute hostnames when data arrives instead of parsing URLs repeatedly:

ALTER TABLE url_logs 
ADD COLUMN hostname String MATERIALIZED domainRFC(url);

This approach calculates the hostname once during insertion and stores it permanently. Queries against the hostname column run much faster than calling domainRFC() every time you run analytics.

2. Use projections for rollups

Create projections that pre-aggregate data by hostname:

ALTER TABLE url_logs 
ADD PROJECTION hostname_stats (
    SELECT hostname, count(), sum(bytes)
    GROUP BY hostname
);

Projections work like materialized views that ClickHouse maintains automatically. When you query traffic patterns by domain, ClickHouse uses the pre-computed aggregations instead of scanning the entire table.

3. Apply low-cardinality dictionaries

When your data contains many repeated hostnames, use LowCardinality encoding:

CREATE TABLE url_logs (
    timestamp DateTime,
    url String,
    hostname LowCardinality(String) MATERIALIZED domainRFC(url)
) ENGINE = MergeTree()
ORDER BY timestamp;

This optimization works best when you have thousands of URLs but only hundreds of unique hostnames—a common pattern in web analytics where users visit many pages on the same domains.

4. Avoid repeated parsing in joins

Extract hostnames once and reuse the result throughout complex queries:

WITH extracted AS (
    SELECT url, domainRFC(url) AS hostname
    FROM url_logs
    WHERE timestamp >= yesterday()
)
SELECT 
    e.hostname,
    count(*) AS requests,
    d.category
FROM extracted e
JOIN domain_categories d ON e.hostname = d.hostname
GROUP BY e.hostname, d.category;

This pattern prevents calling domainRFC() multiple times for the same URLs within a single query, which can add up to significant time savings on large datasets.

Building ClickHouse-based web and URL analytics APIs with Tinybird

Let's create a complete example that demonstrates how to build a real-time API for hostname extraction using Tinybird's managed ClickHouse platform.

This walkthrough shows how to ingest web analytics data, process URLs with ClickHouse functions, and expose hostname analytics through a real-time API endpoint.

1. Create the data source

First, create a data source to store web analytics events:

SCHEMA >
    `timestamp` DateTime64(3) `json:$.timestamp`,
    `url` String `json:$.url`,
    `user_agent` String `json:$.user_agent`,
    `status_code` UInt16 `json:$.status_code`,
    `user_id` String `json:$.user_id`

ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)

Now build the project locally:

tb dev

This builds the project on Tinybird Local and creates your data source. Then ingest some sample data using the Events API:

curl -X POST "http://localhost:7181/v0/events?name=web_analytics" \
  -H "Content-Type: application/json" \
  -d '{"timestamp": "2025-10-03 14:30:15.123", "url": "https://api.example.com:8080/v1/data?param=value", "user_agent": "Mozilla/5.0", "status_code": 200, "user_id": "user123"}
{"timestamp": "2025-10-03 14:30:16.456", "url": "https://www.shop.example.com/products?category=electronics", "user_agent": "Chrome/91.0", "status_code": 200, "user_id": "user456"}
{"timestamp": "2025-10-03 14:30:17.789", "url": "https://user:pass@secure.example.com:8443/admin", "user_agent": "Firefox/89.0", "status_code": 401, "user_id": "user789"}'

2. Create the pipe files

Create a pipe that processes URL data and provides analytics:

NODE web_events
SQL >
    SELECT
        timestamp,
        url,
        user_agent,
        status_code,
        user_id,
        domainRFC(url) AS hostname,
        path(url) AS url_path,
        protocol(url) AS scheme,
        queryString(url) AS query_params
    FROM web_analytics
    WHERE timestamp >= now64(3) - INTERVAL 1 HOUR

NODE hostname_analytics
SQL >
    %
    SELECT
        hostname,
        scheme,
        COUNT(*) AS request_count,
        COUNT(DISTINCT user_id) AS unique_users,
        COUNTIf(status_code >= 400) AS error_count,
        MIN(timestamp) AS first_seen,
        MAX(timestamp) AS last_seen
    FROM web_events
    WHERE
        {\% if defined(hostname_filter) %}
            hostname LIKE {{ String(hostname_filter) }}
        {\% end %}
        {\% if defined(hours_back) %}
            AND timestamp >= now64(3) - INTERVAL {{ Int32(hours_back) }} HOUR
        {\% end %}
    GROUP BY hostname, scheme
    ORDER BY request_count DESC
    LIMIT {{ Int32(limit, 100) }}

TYPE endpoint

3. Deploy and test the API

Deploy your project to Tinybird Cloud:

tb --cloud deploy

Then test your API endpoint:

# Get all hostname analytics from the last hour
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json"

# Filter by hostname pattern
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json?hostname_filter=%example.com"

# Filter by time range
curl -H "Authorization: Bearer <your_token>" \
  "https://api.tinybird.co/v0/pipes/hostname_analytics.json?hours_back=2"

4. Sample API response

The API returns structured data with hostname analytics:

{
  "meta": [
    {
      "name": "hostname",
      "type": "String"
    },
    {
      "name": "scheme",
      "type": "String"
    },
    {
      "name": "request_count",
      "type": "UInt64"
    },
    {
      "name": "unique_users",
      "type": "UInt64"
    },
    {
      "name": "error_count",
      "type": "UInt64"
    },
    {
      "name": "first_seen",
      "type": "DateTime64(3)"
    },
    {
      "name": "last_seen",
      "type": "DateTime64(3)"
    }
  ],
  "data": [
    {
      "hostname": "api.example.com",
      "scheme": "https",
      "request_count": 1247,
      "unique_users": 892,
      "error_count": 23,
      "first_seen": "2025-10-03 14:30:15.123",
      "last_seen": "2025-10-03 15:45:22.789"
    },
    {
      "hostname": "shop.example.com",
      "scheme": "https",
      "request_count": 456,
      "unique_users": 234,
      "error_count": 5,
      "first_seen": "2025-10-03 14:30:16.456",
      "last_seen": "2025-10-03 15:30:18.345"
    }
  ],
  "rows": 2,
  "statistics": {
    "elapsed": 0.045,
    "rows_read": 2000,
    "bytes_read": 89000
  }
}

Common mistakes and how to debug them

Hostname extraction looks simple until you encounter real-world data complexity. Understanding common pitfalls helps you build more robust URL processing pipelines.

Choosing domain instead of domainRFC

The domain() function fails silently with many URL formats, returning empty strings where you'd expect valid hostnames:

SELECT 
    url,
    domain(url) AS domain_result,
    domainRFC(url) AS rfc_result
FROM (
    SELECT 'ftp://user:pass@files.example.com/data' AS url
    UNION ALL
    SELECT 'https://api.example.com:8443/v1'
) 
WHERE domain_result != rfc_result;

Always use domainRFC() unless you've specifically tested domain() against your complete dataset and confirmed it handles all your URL patterns correctly.

Ignoring null results on malformed URLs

Malformed URLs produce empty strings that can break downstream processing:

SELECT 
    url,
    domainRFC(url) AS hostname,
    CASE 
        WHEN hostname = '' THEN 'invalid_url'
        ELSE hostname 
    END AS processed_hostname
FROM url_table
WHERE url IS NOT NULL;

This pattern handles invalid URLs explicitly rather than letting empty strings propagate through your analysis and create confusing results.

Parsing inside a subquery without limit

Nested hostname extraction without proper limits can cause performance issues:

-- Avoid this pattern
SELECT hostname, count(*)
FROM (
    SELECT domainRFC(url) AS hostname
    FROM large_url_table  -- millions of rows
    WHERE hostname IN (
        SELECT domainRFC(competitor_url) 
        FROM competitor_urls  -- parsed repeatedly
    )
)
GROUP BY hostname;

Instead, extract hostnames once and join on the results, or use materialized columns to avoid repeated parsing overhead.

Related URL parsing functions to know

ClickHouse provides a complete suite of URL manipulation functions that work alongside domainRFC() for comprehensive URL analysis.

path

Extract URL path components after the hostname:

SELECT 
    domainRFC(url) AS hostname,
    path(url) AS url_path
FROM url_logs
WHERE url_path LIKE '/api/%';

The path() function returns everything between the hostname and query string, including the leading slash.

protocol

Get scheme information to understand connection types:

SELECT 
    protocol(url) AS scheme,
    domainRFC(url) AS hostname,
    count(*) AS requests
FROM url_logs
GROUP BY scheme, hostname
ORDER BY requests DESC;

This helps analyze traffic patterns across HTTP, HTTPS, FTP, and other protocol types in your data.

queryString

Parse query parameters for detailed request analysis:

SELECT 
    domainRFC(url) AS hostname,
    queryString(url) AS params,
    count(*) AS requests
FROM url_logs
WHERE params != ''
GROUP BY hostname, params;

Query strings contain the parameters passed after the ? character in URLs, which can be valuable for understanding user behavior patterns.

topLevelDomain

Extract TLD information for geographic or organizational analysis:

SELECT 
    topLevelDomain(domainRFC(url)) AS tld,
    count(*) AS requests
FROM url_logs
GROUP BY tld
ORDER BY requests DESC;

This function works on the output of domainRFC() to identify domains by their top-level domain suffix like .com, .org, or country codes.

Function	Purpose	Example Input	Example Output
`domainRFC()`	Extract hostname	`https://api.example.com:443/v1?key=value`	`api.example.com`
`path()`	Extract path	`https://api.example.com:443/v1?key=value`	`/v1`
`protocol()`	Extract scheme	`https://api.example.com:443/v1?key=value`	`https`
`queryString()`	Extract parameters	`https://api.example.com:443/v1?key=value`	`key=value`
`topLevelDomain()`	Extract TLD	`api.example.com`	`com`

Using Tinybird as a managed ClickHouse

Managing ClickHouse infrastructure for hostname extraction involves complexity—from optimizing storage and query performance to handling URL edge cases and maintaining consistent data quality across distributed systems.

Tinybird's managed ClickHouse service reduces infrastructure overhead by providing a managed ClickHouse environment that abstracts optimization, scaling, and maintenance. Tinybird exposes ClickHouse features such as domainRFC() and URL parsing without requiring direct infrastructure management, including database clusters, performance monitoring, or version upgrades.

This enables teams to focus on application development rather than database operations or configuring ClickHouse for production workloads. Tinybird's platform supports developer efficiency through managed scaling and workflow integrations.

Additional resources

Subscribe to SCHEMA > Evolution

We are Tinybird and we manage data for companies like Vercel and Canva. Plus, write a newsletter covering Data, AI and everything that matters in between. Join us.

FAQs about extracting RFC-compliant hostnames in ClickHouse

Which ClickHouse version introduced domainRFC?

The domainRFC() function was added in ClickHouse version 21.4 as an improvement over the original domain() function. Earlier versions only had the less reliable domain() function available.

How do I strip subdomains like www from the extracted hostname?

Use domainWithoutWWW(domainRFC(url)) to remove common www prefixes from the extracted hostname. This function specifically handles the www subdomain while preserving other subdomain structures.

Does domainRFC handle ftp or custom protocol schemes?

Yes, domainRFC() extracts hostnames from any valid URL regardless of the protocol scheme used. It works with HTTP, HTTPS, FTP, SSH, and custom protocols as long as the URL follows standard formatting.

Is domainRFC deterministic for materialized views?

Yes, domainRFC() produces consistent results for the same input, making it safe for materialized view calculations. The function's deterministic behavior ensures that materialized columns and projections update correctly during data ingestion.

Skip the infra work. Deploy your first ClickHouse project now

Product /

Company /

Resources /

Integrations /

Use Cases /

How to extract RFC-compliant hostnames from URLs in ClickHouse