When you're processing millions of URLs in ClickHouse, the difference between getting example.com
and an empty string can break your entire analytics pipeline. The domainRFC()
function extracts RFC 3986-compliant hostnames from URLs, handling edge cases that the simpler domain()
function can't parse correctly.
This guide covers the syntax differences between ClickHouse's hostname extraction functions, performance optimization strategies for large datasets, and how to build production-ready APIs that process URL data reliably.
What makes a hostname RFC-compliant in ClickHouse
In ClickHouse, the domainRFC
function extracts hostnames from URLs following the RFC 3986 standard. This function handles complex URL structures that the simpler domain
function can't parse correctly.
RFC 3986 defines how URLs work (everything from the protocol (https://
) to the hostname (example.com
) to the path (/api/data
)). When you're dealing with URLs that contain user credentials, ports, or unusual formatting, RFC compliance becomes important for getting consistent results.
Here's what I mean: if you have a URL like https://user:password@api.example.com:8080/data
, the RFC-compliant approach correctly identifies api.example.com
as the hostname while ignoring the authentication details and port number.
The domainRFC function versus domain
ClickHouse gives you two options for hostname extraction, but they behave very differently. The domain()
function prioritizes speed over accuracy, while domainRFC()
follows web standards even when URLs get messy.
Let me show you exactly what happens with a tricky URL:
SELECT
domain('http://user:password@example.com:8080/path') AS domain_result,
domainRFC('http://user:password@example.com:8080/path') AS rfc_result;
domain_result | rfc_result |
---|---|
(empty string) | example.com |
The domain()
function gives up when it sees user credentials in the URL. Meanwhile, domainRFC()
correctly extracts the hostname you actually want. This difference matters when you're processing real-world data where URLs come from browsers, APIs, and log files that don't always follow perfect formatting.
Syntax and minimal query to extract hostnames
The domainRFC()
function takes one parameter—the URL string—and returns the hostname as a string.
1. Run a one-line select
SELECT domainRFC('https://clickhouse.com/docs/sql-reference/') AS hostname;
This returns clickhouse.com
. The function automatically strips away the protocol, path, query parameters, and everything else that isn't the hostname.
2. Cast result and check nullability
The function returns a String
type, but you'll get an empty string when the URL is malformed. You can check for valid results like this:
SELECT
url,
domainRFC(url) AS hostname,
hostname != '' AS is_valid
FROM my_table;
When ClickHouse can't find a valid hostname in your URL, domainRFC()
returns an empty string rather than throwing an error that would break your query.
Edge cases you handle in production
Real URL data gets messy fast. Web logs contain everything from IPv6 addresses to internationalized domains to completely broken URLs that somehow still need processing.
IPv6 literals
URLs can contain IPv6 addresses wrapped in square brackets:
SELECT domainRFC('http://[2001:db8::1]:8080/path') AS ipv6_host;
The function returns [2001:db8::1]
, keeping the bracket notation that distinguishes IPv6 addresses from regular hostnames.
Internationalized domain names
Unicode domain names get processed according to RFC standards:
SELECT domainRFC('https://例え.テスト/path') AS unicode_domain;
The function handles internationalized domain names (IDNs) by returning the Unicode representation as it appears in the original URL.
URLs with ports
Port numbers get stripped automatically:
SELECT domainRFC('https://api.example.com:443/v1/data') AS hostname_no_port;
This returns api.example.com
, removing the :443
port specification that's redundant for HTTPS URLs anyway.
- Missing schemes: URLs without
http://
orhttps://
may return empty strings - Invalid characters: Special characters get handled according to RFC encoding rules
- Empty inputs: Null or empty strings return empty results without breaking your query
Performance tactics for large tables
When you're processing millions of URLs, hostname extraction can slow down your queries significantly. However, a few optimization strategies can make a significant difference in performance.
1. Create a materialized column
Pre-compute hostnames when data arrives instead of parsing URLs repeatedly:
ALTER TABLE url_logs
ADD COLUMN hostname String MATERIALIZED domainRFC(url);
This approach calculates the hostname once during insertion and stores it permanently. Queries against the hostname
column run much faster than calling domainRFC()
every time you run analytics.
2. Use projections for rollups
Create projections that pre-aggregate data by hostname:
ALTER TABLE url_logs
ADD PROJECTION hostname_stats (
SELECT hostname, count(), sum(bytes)
GROUP BY hostname
);
Projections work like materialized views that ClickHouse maintains automatically. When you query traffic patterns by domain, ClickHouse uses the pre-computed aggregations instead of scanning the entire table.
3. Apply low-cardinality dictionaries
When your data contains many repeated hostnames, use LowCardinality
encoding:
CREATE TABLE url_logs (
timestamp DateTime,
url String,
hostname LowCardinality(String) MATERIALIZED domainRFC(url)
) ENGINE = MergeTree()
ORDER BY timestamp;
This optimization works best when you have thousands of URLs but only hundreds of unique hostnames—a common pattern in web analytics where users visit many pages on the same domains.
4. Avoid repeated parsing in joins
Extract hostnames once and reuse the result throughout complex queries:
WITH extracted AS (
SELECT url, domainRFC(url) AS hostname
FROM url_logs
WHERE timestamp >= yesterday()
)
SELECT
e.hostname,
count(*) AS requests,
d.category
FROM extracted e
JOIN domain_categories d ON e.hostname = d.hostname
GROUP BY e.hostname, d.category;
This pattern prevents calling domainRFC()
multiple times for the same URLs within a single query, which can add up to significant time savings on large datasets.
Building ClickHouse-based web and URL analytics APIs with Tinybird
Let's create a complete example that demonstrates how to build a real-time API for hostname extraction using Tinybird's managed ClickHouse platform.
This walkthrough shows how to ingest web analytics data, process URLs with ClickHouse functions, and expose hostname analytics through a real-time API endpoint.
1. Create the data source
First, create a data source to store web analytics events:
SCHEMA >
`timestamp` DateTime64(3) `json:$.timestamp`,
`url` String `json:$.url`,
`user_agent` String `json:$.user_agent`,
`status_code` UInt16 `json:$.status_code`,
`user_id` String `json:$.user_id`
ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)
Now build the project locally:
tb dev
This builds the project on Tinybird Local and creates your data source. Then ingest some sample data using the Events API:
curl -X POST "http://localhost:7181/v0/events?name=web_analytics" \
-H "Content-Type: application/json" \
-d '{"timestamp": "2025-10-03 14:30:15.123", "url": "https://api.example.com:8080/v1/data?param=value", "user_agent": "Mozilla/5.0", "status_code": 200, "user_id": "user123"}
{"timestamp": "2025-10-03 14:30:16.456", "url": "https://www.shop.example.com/products?category=electronics", "user_agent": "Chrome/91.0", "status_code": 200, "user_id": "user456"}
{"timestamp": "2025-10-03 14:30:17.789", "url": "https://user:pass@secure.example.com:8443/admin", "user_agent": "Firefox/89.0", "status_code": 401, "user_id": "user789"}'
2. Create the pipe files
Create a pipe that processes URL data and provides analytics:
NODE web_events
SQL >
SELECT
timestamp,
url,
user_agent,
status_code,
user_id,
domainRFC(url) AS hostname,
path(url) AS url_path,
protocol(url) AS scheme,
queryString(url) AS query_params
FROM web_analytics
WHERE timestamp >= now64(3) - INTERVAL 1 HOUR
NODE hostname_analytics
SQL >
%
SELECT
hostname,
scheme,
COUNT(*) AS request_count,
COUNT(DISTINCT user_id) AS unique_users,
COUNTIf(status_code >= 400) AS error_count,
MIN(timestamp) AS first_seen,
MAX(timestamp) AS last_seen
FROM web_events
WHERE
{\% if defined(hostname_filter) %}
hostname LIKE {{ String(hostname_filter) }}
{\% end %}
{\% if defined(hours_back) %}
AND timestamp >= now64(3) - INTERVAL {{ Int32(hours_back) }} HOUR
{\% end %}
GROUP BY hostname, scheme
ORDER BY request_count DESC
LIMIT {{ Int32(limit, 100) }}
TYPE endpoint
3. Deploy and test the API
Deploy your project to Tinybird Cloud:
tb --cloud deploy
Then test your API endpoint:
# Get all hostname analytics from the last hour
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/hostname_analytics.json"
# Filter by hostname pattern
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/hostname_analytics.json?hostname_filter=%example.com"
# Filter by time range
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/hostname_analytics.json?hours_back=2"
4. Sample API response
The API returns structured data with hostname analytics:
{
"meta": [
{
"name": "hostname",
"type": "String"
},
{
"name": "scheme",
"type": "String"
},
{
"name": "request_count",
"type": "UInt64"
},
{
"name": "unique_users",
"type": "UInt64"
},
{
"name": "error_count",
"type": "UInt64"
},
{
"name": "first_seen",
"type": "DateTime64(3)"
},
{
"name": "last_seen",
"type": "DateTime64(3)"
}
],
"data": [
{
"hostname": "api.example.com",
"scheme": "https",
"request_count": 1247,
"unique_users": 892,
"error_count": 23,
"first_seen": "2025-10-03 14:30:15.123",
"last_seen": "2025-10-03 15:45:22.789"
},
{
"hostname": "shop.example.com",
"scheme": "https",
"request_count": 456,
"unique_users": 234,
"error_count": 5,
"first_seen": "2025-10-03 14:30:16.456",
"last_seen": "2025-10-03 15:30:18.345"
}
],
"rows": 2,
"statistics": {
"elapsed": 0.045,
"rows_read": 2000,
"bytes_read": 89000
}
}
Common mistakes and how to debug them
Hostname extraction looks simple until you encounter real-world data complexity. Understanding common pitfalls helps you build more robust URL processing pipelines.
Choosing domain instead of domainRFC
The domain()
function fails silently with many URL formats, returning empty strings where you'd expect valid hostnames:
SELECT
url,
domain(url) AS domain_result,
domainRFC(url) AS rfc_result
FROM (
SELECT 'ftp://user:pass@files.example.com/data' AS url
UNION ALL
SELECT 'https://api.example.com:8443/v1'
)
WHERE domain_result != rfc_result;
Always use domainRFC()
unless you've specifically tested domain()
against your complete dataset and confirmed it handles all your URL patterns correctly.
Ignoring null results on malformed URLs
Malformed URLs produce empty strings that can break downstream processing:
SELECT
url,
domainRFC(url) AS hostname,
CASE
WHEN hostname = '' THEN 'invalid_url'
ELSE hostname
END AS processed_hostname
FROM url_table
WHERE url IS NOT NULL;
This pattern handles invalid URLs explicitly rather than letting empty strings propagate through your analysis and create confusing results.
Parsing inside a subquery without limit
Nested hostname extraction without proper limits can cause performance issues:
-- Avoid this pattern
SELECT hostname, count(*)
FROM (
SELECT domainRFC(url) AS hostname
FROM large_url_table -- millions of rows
WHERE hostname IN (
SELECT domainRFC(competitor_url)
FROM competitor_urls -- parsed repeatedly
)
)
GROUP BY hostname;
Instead, extract hostnames once and join on the results, or use materialized columns to avoid repeated parsing overhead.
Related URL parsing functions to know
ClickHouse provides a complete suite of URL manipulation functions that work alongside domainRFC()
for comprehensive URL analysis.
path
Extract URL path components after the hostname:
SELECT
domainRFC(url) AS hostname,
path(url) AS url_path
FROM url_logs
WHERE url_path LIKE '/api/%';
The path()
function returns everything between the hostname and query string, including the leading slash.
protocol
Get scheme information to understand connection types:
SELECT
protocol(url) AS scheme,
domainRFC(url) AS hostname,
count(*) AS requests
FROM url_logs
GROUP BY scheme, hostname
ORDER BY requests DESC;
This helps analyze traffic patterns across HTTP, HTTPS, FTP, and other protocol types in your data.
queryString
Parse query parameters for detailed request analysis:
SELECT
domainRFC(url) AS hostname,
queryString(url) AS params,
count(*) AS requests
FROM url_logs
WHERE params != ''
GROUP BY hostname, params;
Query strings contain the parameters passed after the ?
character in URLs, which can be valuable for understanding user behavior patterns.
topLevelDomain
Extract TLD information for geographic or organizational analysis:
SELECT
topLevelDomain(domainRFC(url)) AS tld,
count(*) AS requests
FROM url_logs
GROUP BY tld
ORDER BY requests DESC;
This function works on the output of domainRFC()
to identify domains by their top-level domain suffix like .com
, .org
, or country codes.
Function | Purpose | Example Input | Example Output |
---|---|---|---|
domainRFC() | Extract hostname | https://api.example.com:443/v1?key=value | api.example.com |
path() | Extract path | https://api.example.com:443/v1?key=value | /v1 |
protocol() | Extract scheme | https://api.example.com:443/v1?key=value | https |
queryString() | Extract parameters | https://api.example.com:443/v1?key=value | key=value |
topLevelDomain() | Extract TLD | api.example.com | com |
Using Tinybird as a managed ClickHouse
Managing ClickHouse infrastructure for hostname extraction involves complexity—from optimizing storage and query performance to handling URL edge cases and maintaining consistent data quality across distributed systems.
Tinybird's managed ClickHouse service reduces infrastructure overhead by providing a managed ClickHouse environment that abstracts optimization, scaling, and maintenance. Tinybird exposes ClickHouse features such as domainRFC()
and URL parsing without requiring direct infrastructure management, including database clusters, performance monitoring, or version upgrades.
This enables teams to focus on application development rather than database operations or configuring ClickHouse for production workloads. Tinybird's platform supports developer efficiency through managed scaling and workflow integrations.
Sign up for a free Tinybird account to build and test your first ClickHouse-based API in just a few minutes.
Additional resources
- domainRFC function
- domain function
- path function
- protocol function
- queryString function
- topLevelDomain function
- URL functions overview
- How to extract domain without www in ClickHouse
- Extract URL query string in ClickHouse
- How to build URL hierarchy in ClickHouse
FAQs about extracting RFC-compliant hostnames in ClickHouse
Which ClickHouse version introduced domainRFC?
The domainRFC()
function was added in ClickHouse version 21.4 as an improvement over the original domain()
function. Earlier versions only had the less reliable domain()
function available.
How do I strip subdomains like www from the extracted hostname?
Use domainWithoutWWW(domainRFC(url))
to remove common www prefixes from the extracted hostname. This function specifically handles the www subdomain while preserving other subdomain structures.
Does domainRFC handle ftp or custom protocol schemes?
Yes, domainRFC()
extracts hostnames from any valid URL regardless of the protocol scheme used. It works with HTTP, HTTPS, FTP, SSH, and custom protocols as long as the URL follows standard formatting.
Is domainRFC deterministic for materialized views?
Yes, domainRFC()
produces consistent results for the same input, making it safe for materialized view calculations. The function's deterministic behavior ensures that materialized columns and projections update correctly during data ingestion.