When you're analyzing web traffic or building domain-level analytics, URLs arrive in frustratingly inconsistent formats, some with www prefixes, others without, making it impossible to group www.example.com
and example.com
as the same domain. ClickHouse's domainWithoutWWW
function solves this by extracting clean domain names from any URL format while automatically stripping the www prefix. This guide walks you through the function's syntax and edge cases, shows you how to process entire tables efficiently, and demonstrates how to build real-time domain analytics APIs using scalable infrastructure.
What domainWithoutWWW
does and when to use it
The domainWithoutWWW
function extracts the domain from a URL and automatically removes any leading "www." prefix if present. When you pass 'https://www.example.com/path'
to this function, it returns 'example.com'
, stripping away the protocol, path, query parameters, and the www prefix in one operation.
This function becomes essential when you're working with web analytics data where URLs might inconsistently include or exclude the www prefix. Web traffic often comes from both www.example.com
and example.com
, but you want to treat them as the same domain for analytics purposes.
You'll find it particularly useful for grouping page views by domain, normalizing referrer data, or preparing URLs for domain-level aggregations. The function handles URL parsing complexity internally, so you don't have to write custom string manipulation logic to achieve consistent domain extraction across your dataset.
Syntax and parameters
The function follows a straightforward syntax pattern:
domainWithoutWWW(url)
The single parameter accepts a String type (or Nullable(String) if your data contains null values). You can pass full URLs with protocols, bare hostnames, or even URLs with complex query strings and fragments, and the function intelligently extracts just the domain portion.
Here's how it works with a simple example:
SELECT domainWithoutWWW('https://www.example.com/path?query=value') AS clean_domain;
-- Returns: 'example.com'
Return type and edge cases
The function returns a String (or Nullable(String) if your input column allows nulls). However, you'll encounter specific behaviors with malformed or unusual inputs that are worth understanding upfront.
When the function can't parse a valid domain from the input, it typically returns an empty string rather than throwing an error. This graceful handling means your queries won't break on messy data, though you might want to filter out empty results depending on your use case.
Empty Or Null Input
Empty strings and null values behave predictably:
SELECT domainWithoutWWW('') AS empty_result;
-- Returns: ''
SELECT domainWithoutWWW(NULL) AS null_result;
-- Returns: NULL
If you're working with a Nullable(String) column and want to avoid nulls in your results, you can use coalesce
:
SELECT coalesce(domainWithoutWWW(url), '') AS clean_domain
FROM your_table;
Internationalized Domains
ClickHouse URL functions work with ASCII hostnames, so internationalized domains work best when already converted to punycode format:
SELECT domainWithoutWWW('https://www.xn--e1afmkfd.xn--p1ai') AS punycode_domain;
-- Returns: 'xn--e1afmkfd.xn--p1ai'
Unicode hostnames passed directly might produce inconsistent results depending on your ClickHouse build and locale settings.
Missing Protocol
The function can extract domains from both full URLs and bare hostnames, though results vary with malformed strings:
SELECT domainWithoutWWW('www.example.com') AS bare_hostname;
-- Returns: 'example.com'
SELECT domainWithoutWWW('example.com/some/path') AS with_path;
-- Returns: 'example.com'
SELECT domainWithoutWWW('/just/a/path') AS path_only;
-- Returns: ''
Quick examples on single URLs
Let's walk through practical examples that demonstrate the function's behavior across different URL formats you'll encounter in real data.
Standard www URLs get cleaned as expected:
SELECT domainWithoutWWW('http://www.example.com') AS http_www;
-- Returns: 'example.com'
SELECT domainWithoutWWW('https://www.example.com:8443?q=1') AS https_with_port;
-- Returns: 'example.com'
URLs without www remain unchanged (except for protocol and path removal):
SELECT domainWithoutWWW('https://example.com') AS no_www;
-- Returns: 'example.com'
SELECT domainWithoutWWW('sub.example.com') AS subdomain;
-- Returns: 'sub.example.com'
Complex URLs with paths, queries, and fragments still extract cleanly:
domainWithoutWWW('https://www.blog.example.com/path/article?id=123#section')
returns'blog.example.com'
domainWithoutWWW('ftp://www.example.co.uk/files/download')
returns'example.co.uk'
Notice how the function preserves subdomains like "blog" while removing only the "www" prefix specifically.
Processing an entire ClickHouse table
Moving from single values to processing entire tables unlocks the real power of domain normalization at scale. Let's assume you have a table structure like this:
CREATE TABLE visits (
ts DateTime,
url String,
user_id UInt64
) ENGINE = MergeTree
ORDER BY ts;
Selecting Clean Domains
Extract and analyze cleaned domains across your dataset:
SELECT
domainWithoutWWW(url) AS clean_domain,
count() AS visits,
uniq(user_id) AS unique_users
FROM visits
WHERE ts >= now() - INTERVAL 7 DAY
GROUP BY clean_domain
ORDER BY visits DESC
LIMIT 10;
You can filter out malformed URLs that produce empty domains:
SELECT *
FROM visits
WHERE domainWithoutWWW(url) != ''
AND ts >= today() - 1;
Updating In Place
For tables where you frequently query by clean domain, consider adding a dedicated column:
ALTER TABLE visits ADD COLUMN clean_domain String DEFAULT '';
ALTER TABLE visits UPDATE clean_domain = domainWithoutWWW(url) WHERE 1;
Performance consideration: ALTER...UPDATE operations rewrite data and can be expensive on large tables. Schedule during low-traffic periods or consider the materialized column approach instead.
A materialized column automatically computes the clean domain for new data:
ALTER TABLE visits ADD COLUMN clean_domain String MATERIALIZED domainWithoutWWW(url);
Materialized View Pattern
For high-frequency analytics queries, you can precompute domain aggregations using materialized views:
CREATE TABLE domain_stats (
clean_domain String,
visits AggregateFunction(count),
unique_users AggregateFunction(uniq, UInt64)
) ENGINE = AggregatingMergeTree
ORDER BY clean_domain;
CREATE MATERIALIZED VIEW visits_to_domain_stats
TO domain_stats
AS SELECT
domainWithoutWWW(url) AS clean_domain,
countState() AS visits,
uniqState(user_id) AS unique_users
FROM visits
GROUP BY clean_domain;
This pattern precomputes domain-level metrics at insert time, dramatically speeding up dashboard queries that aggregate by domain.
domainWithoutWWW
vs cutWWW
vs domain
ClickHouse provides several related functions for URL and domain processing, each optimized for different use cases:
domainWithoutWWW
: Extracts domain from full URLs and removes www prefixcutWWW
: Removes www prefix from domain strings onlydomain
: Extracts domain from URLs but preserves www
Here's how they differ in practice:
-- domainWithoutWWW: Full URL processing with www removal
SELECT domainWithoutWWW('https://www.example.com/path') AS result;
-- Returns: 'example.com'
-- cutWWW: String processing for already-extracted domains
SELECT cutWWW('www.example.com') AS result;
-- Returns: 'example.com'
-- domain: Full URL processing but keeps www
SELECT domain('https://www.example.com/path') AS result;
-- Returns: 'www.example.com'
Choose domainWithoutWWW
when you're working with raw URL data and want normalized domain grouping. Use cutWWW
for post-processing already-extracted domains, and domain
when the www distinction matters for your analysis.
Performance considerations at scale
URL parsing involves string operations that can become expensive at scale, especially when processing billions of rows. Understanding performance characteristics helps you architect efficient data pipelines.
Function Costs In CPU Cycles
String parsing adds computational overhead proportional to your URL complexity and data volume. URLs with long paths, many query parameters, or unusual encoding patterns require more CPU cycles to process.
The function also creates new string objects for each result, which impacts memory allocation patterns and garbage collection in high-throughput scenarios.
Using Projections Or Views
Precomputing cleaned domains often yields better query performance than on-the-fly parsing:
ALTER TABLE visits ADD PROJECTION domain_projection (
SELECT
domainWithoutWWW(url) AS clean_domain,
count()
GROUP BY clean_domain
);
Materialized views can aggregate by clean domain during data ingestion, eliminating parsing overhead from analytical queries entirely.
Benchmark Results
In practice, on-the-fly parsing works well for exploratory queries and datasets under 100M rows. Beyond that scale, or for frequently-accessed dashboard queries, precomputation through materialized columns or views typically provides 2-5x performance improvements.
The exact performance gain depends on your URL complexity, query patterns, and hardware configuration, but the pattern holds consistently across different workloads.
Building a real-time, ClickHouse-based web analytics API with Tinybird
Let's create a complete example that ingests web traffic data and exposes an API endpoint returning domain-level analytics using domainWithoutWWW
.
This walkthrough shows how domain normalization integrates into real-world analytics workflows, from data ingestion through API deployment. You'll see exactly how the function works in practice.
Tinybird is a managed service that makes it easy to build real-time analytics APIs powered by ClickHouse. It handles the infrastructure, scaling, and API generation so you can focus on writing SQL and building analytics features.
In this example, we'll use Tinybird to:
- Create a data source for ingesting web traffic events
- Write a pipe that processes and aggregates the data using
domainWithoutWWW
- Deploy an API endpoint that serves domain-level analytics
The workflow demonstrates how to combine ClickHouse's powerful URL processing with Tinybird's API capabilities to build production-ready analytics services.
1. Create the data source
First, create a data source to store your web traffic events:
SCHEMA >
`timestamp` DateTime64(3, 'UTC') `json:$.timestamp`,
`url` String `json:$.url`,
`user_id` UInt32 `json:$.user_id`,
`session_id` String `json:$.session_id`,
`page_title` String `json:$.page_title`,
`referrer` String `json:$.referrer`
ENGINE MergeTree
ENGINE_SORTING_KEY timestamp
ENGINE_PARTITION_KEY toYYYYMM(timestamp)
Now build the data source locally:
tb dev
This builds the data source on Tinybird Local (localhost:7181
), which you'll need to have running. Then ingest some sample data using the Events API:
curl -X POST "http://localhost:7181/v0/events?name=web_traffic" \
-H "Content-Type: application/json" \
-d '{"timestamp": "2025-10-03 14:30:15.123", "url": "https://www.example.com/products", "user_id": 1001, "session_id": "sess_abc123", "page_title": "Products Page", "referrer": "https://www.google.com/search"}
{"timestamp": "2025-10-03 14:30:16.456", "url": "https://blog.example.com/article", "user_id": 1002, "session_id": "sess_def456", "page_title": "Blog Article", "referrer": "https://www.facebook.com"}
{"timestamp": "2025-10-03 14:30:17.789", "url": "https://www.github.com/tinybird", "user_id": 1003, "session_id": "sess_ghi789", "page_title": "GitHub Profile", "referrer": "https://www.twitter.com"}'
2. Write the pipe query
Create a pipe that aggregates traffic by normalized domain:
NODE traffic_events
SQL >
SELECT
timestamp,
domainWithoutWWW(url) AS clean_domain,
user_id,
session_id,
page_title,
domainWithoutWWW(referrer) AS clean_referrer
FROM web_traffic
WHERE domainWithoutWWW(url) != ''
AND timestamp >= now() - INTERVAL 7 DAY
NODE domain_analytics
SQL >
%
SELECT
clean_domain,
COUNT(*) AS total_visits,
COUNT(DISTINCT user_id) AS unique_users,
COUNT(DISTINCT session_id) AS unique_sessions,
COUNT(DISTINCT clean_referrer) AS unique_referrers,
MIN(timestamp) AS first_visit,
MAX(timestamp) AS last_visit
FROM traffic_events
WHERE
clean_domain = {{ String(domain) }}
AND timestamp >= {{ DateTime(date_from) }}
AND timestamp <= {{ DateTime(date_to) }}
GROUP BY clean_domain
ORDER BY total_visits DESC
LIMIT {{ Int32(limit, 100) }}
TYPE endpoint
The domainWithoutWWW
function ensures consistent domain grouping regardless of www prefixes in your URLs.
3. Deploy and call the endpoint
Deploy your pipe to create the API endpoint:
tb --cloud deploy
Then call your new endpoint:
curl -H "Authorization: Bearer <your_token>" \
"https://api.tinybird.co/v0/pipes/domain_analytics.json?domain=example.com&limit=5"
The response returns neatly aggregated data with normalized domains:
{
...,
"data": [
{
"clean_domain": "example.com",
"total_visits": 1247,
"unique_users": 892,
"unique_sessions": 1156,
"unique_referrers": 23,
"first_visit": "2025-10-03 14:30:15",
"last_visit": "2025-10-03 15:45:22"
}
],
"rows": 1,
"statistics": {
"elapsed": 0.023,
"rows_read": 1000,
"bytes_read": 45000
}
}
Next steps with ClickHouse and Tinybird
ClickHouse is greating for normalizing URL data to build clean domain-level analytics. For large datasets, ClickHouse allows you to precompute clean domains through materialized columns or views to optimize query performance.
When you're ready to operationalize your ClickHouse-based analytics workflows into APIs, Tinybird enables you to deploy production APIs over your data transformations and manages the operational complexity of the underlying infrastructure.
Get started with a free Tinybird account to create real-time APIs over your domain analysis data.
FAQs about extracting domains in ClickHouse
Does domainWithoutWWW
handle multiple subdomains like blog.www.example.com?
The function only strips a leading "www." prefix, leaving other subdomains intact. So domainWithoutWWW('https://blog.www.example.com/article')
returns 'blog.example.com'
, removing the www but preserving the blog subdomain.
Is the domainWithoutWWW
function available in older ClickHouse versions?
Yes, domainWithoutWWW
has been part of ClickHouse's core URL functions for many years and works across most maintained versions. You won't encounter compatibility issues unless you're running a very outdated ClickHouse installation.
Can I remove both the protocol and www together in one operation?
Absolutely. The domainWithoutWWW
function automatically handles both, extracting the hostname from any valid URL format and removing the www prefix in a single operation. domainWithoutWWW('https://www.example.com/path?query=1')
gives you 'example.com'
directly.
Additional resources
- domainWithoutWWW function
- cutWWW function
- domain function
- URL functions in ClickHouse
- String functions in ClickHouse
- Working with URLs in Tinybird
- How to extract URL fragments without hash symbols (#) in ClickHouse
- How to extract the protocol of a URL in ClickHouse
- How to extract query parameters from URLs in ClickHouse