Parametric aggregate functions¶
Some aggregate functions accept parameters in addition to their arguments. These parameters are constants used for initialization or configuration. The syntax for these functions uses two pairs of brackets: the first for parameters, and the second for arguments.
histogram¶
Calculates an approximate histogram for a set of numerical values. This function does not guarantee precise results, as it uses an adaptive algorithm.
Syntax¶
histogram(number_of_bins)(values)
Arguments¶
values: An expression representing the input numerical values for which to build the histogram.
Parameters¶
number_of_bins: UInt. The desired upper limit for the number of bins in the histogram. The function attempts to create this many bins but may use fewer if necessary.
Returns¶
An array of tuples, where each tuple represents a bin in the histogram. Each tuple contains three elements: (lower_bound, upper_bound, height). Type: Array(Tuple(Float64, Float64, Float64))
Example¶
SELECT histogram(5)(number + 1)
FROM (
SELECT number
FROM system.numbers
LIMIT 20
)
Result:
┌─histogram(5)(plus(number, 1))───────────────────────────────────────────┐ │ [(1,4.5,4),(4.5,8.5,4),(8.5,12.75,4.125),(12.75,17,4.625),(17,20,3.25)] │ └─────────────────────────────────────────────────────────────────────────┘
You can visualize a histogram with the bar function:
WITH histogram(5)(rand() % 100) AS hist
SELECT
arrayJoin(hist).3 AS height,
bar(height, 0, 6, 5) AS bar
FROM
(
SELECT number
FROM system.numbers
LIMIT 20
)
Result:
┌─height─┬─bar───┐ │ 2.125 │ █▋ │ │ 3.25 │ ██▌ │ │ 5.625 │ ████▏ │ │ 5.625 │ ████▏ │ │ 3.375 │ ██▌ │ └────────┴───────┘
sequenceMatch¶
Determines if a sequence of events matches a specified pattern. It returns 1 if a match is found, and 0 otherwise.
Syntax¶
sequenceMatch(pattern)(timestamp, cond1, cond2, ...)
Events that occur at the same second may be processed in an undefined order, which can affect the result.
Arguments¶
timestamp: Date, DateTime, or UInt. A column containing time data, used to order events.cond1,cond2, ...: UInt8. Boolean conditions that define the events in the sequence. Up to 32 conditions can be provided. Events not described by any condition are skipped.
Parameters¶
pattern: String. A string defining the event sequence pattern to match. See Pattern syntax below.
Returns¶
- 1 if the pattern is matched.
- 0 if the pattern is not matched. Type:
UInt8.
Pattern syntax¶
(?N): Matches the N-th condition argument (e.g.,(?1)matchescond1). Conditions are numbered from 1 to 32..*: Matches any number of events between specified conditions.(?t operator value): Specifies a time constraint between two events.operatorcan be>=,>,<,<=,==, andvalueis a time in seconds. For example,(?1)(?t>1800)(?2)matchescond1followed bycond2with more than 1800 seconds between them.
Example¶
Consider a dataset of events with time and number values.
SELECT sequenceMatch('(?1)(?2)')(time, number = 1, number = 2)
FROM (
SELECT 1 AS time, 1 AS number UNION ALL
SELECT 2 AS time, 3 AS number UNION ALL
SELECT 3 AS time, 2 AS number
) AS t
Result:
┌─sequenceMatch('(?1)(?2)')(time, equals(number, 1), equals(number, 2))─┐
│ 1 │
└───────────────────────────────────────────────────────────────────────┘
sequenceCount¶
Counts the number of non-overlapping event chains that match a specified pattern. The function searches for the next chain immediately after a current chain is matched.
Syntax¶
sequenceCount(pattern)(timestamp, cond1, cond2, ...)
Events that occur at the same second may be processed in an undefined order, which can affect the result.
Arguments¶
timestamp: Date, DateTime, or UInt. A column containing time data, used to order events.cond1,cond2, ...: UInt8. Boolean conditions that define the events in the sequence. Up to 32 conditions can be provided. Events not described by any condition are skipped.
Parameters¶
pattern: String. A string defining the event sequence pattern to match. See Pattern syntax forsequenceMatch.
Returns¶
The number of non-overlapping event chains that match the pattern. Type: UInt64.
Example¶
Count how many times number = 2 occurs after number = 1, allowing any number of other events in between.
SELECT sequenceCount('(?1).*(?2)')(time, number = 1, number = 2)
FROM (
SELECT 1 AS time, 1 AS number UNION ALL
SELECT 2 AS time, 3 AS number UNION ALL
SELECT 3 AS time, 2 AS number UNION ALL
SELECT 4 AS time, 1 AS number UNION ALL
SELECT 5 AS time, 3 AS number UNION ALL
SELECT 6 AS time, 2 AS number
) AS t
Result:
┌─sequenceCount('(?1).*(?2)')(time, equals(number, 1), equals(number, 2))─┐
│ 2 │
└─────────────────────────────────────────────────────────────────────────┘
See Also¶
sequenceMatchEvents¶
Returns an array of timestamps corresponding to the events in the longest matched event chain for a given pattern.
Syntax¶
sequenceMatchEvents(pattern)(timestamp, cond1, cond2, ...)
Events that occur at the same second may be processed in an undefined order, which can affect the result.
Arguments¶
timestamp: Date, DateTime, or UInt. A column containing time data, used to order events.cond1,cond2, ...: UInt8. Boolean conditions that define the events in the chain. Up to 32 conditions can be provided. Events not described by any condition are skipped.
Parameters¶
pattern: String. A string defining the event sequence pattern to match. See Pattern syntax forsequenceMatch.
Returns¶
An array of timestamps for the matched condition arguments (?N) from the longest event chain. The position in the array corresponds to the position of the condition argument in the pattern. Type: Array(TimestampType).
Example¶
Return timestamps of events for the longest chain matching (?1).*(?2).*(?1)(?3).
SELECT sequenceMatchEvents('(?1).*(?2).*(?1)(?3)')(time, number = 1, number = 2, number = 4)
FROM (
SELECT 1 AS time, 1 AS number UNION ALL
SELECT 2 AS time, 3 AS number UNION ALL
SELECT 3 AS time, 2 AS number UNION ALL
SELECT 4 AS time, 1 AS number UNION ALL
SELECT 5 AS time, 3 AS number UNION ALL
SELECT 6 AS time, 2 AS number
) AS t
Result:
┌─sequenceMatchEvents('(?1).*(?2).*(?1)(?3)')(time, equals(number, 1), equals(number, 2), equals(number, 4))─┐
│ [1,3,4] │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
See Also¶
windowFunnel¶
Analyzes event chains within a sliding time window and calculates the maximum number of consecutive conditions met from the chain.
Syntax¶
windowFunnel(window, [mode, ...])(timestamp, cond1, cond2, ..., condN)
Arguments¶
timestamp: Date, DateTime, or UInt. The column containing event timestamps.cond: UInt8. Boolean conditions describing the sequence of events.
Parameters¶
window: UInt. The length of the sliding time window. This defines the maximum time interval (in the units of thetimestampcolumn) between the first and last matched condition for a chain to be considered valid.mode: String. Optional. One or more modes can be specified to modify the funnel behavior:'strict_deduplication': If the same condition holds for consecutive events, further processing is interrupted.'strict_order': Prevents other events from intervening between matched conditions.'strict_increase': Only applies conditions to events with strictly increasing timestamps.'strict_once': Counts each event only once within a chain, even if it meets multiple conditions.
Returns¶
The maximum number of consecutive conditions from the chain that were met within the sliding time window. Type: Integer.
Example¶
Determine the maximum funnel stage reached by users within a 7-day window (604800 seconds).
SCHEMA >
event_date Date `json:$.event_date`,
user_id UInt32 `json:$.user_id`,
timestamp DateTime `json:$.timestamp`,
eventID UInt32 `json:$.eventID`,
product String `json:$.product`
ENGINE "MergeTree"
tb push datasources/trend.datasource
echo '
{"event_date": "2019-01-28", "user_id": 1, "timestamp": "2019-01-29 10:00:00", "eventID": 1003, "product": "phone"}
{"event_date": "2019-01-31", "user_id": 1, "timestamp": "2019-01-31 09:00:00", "eventID": 1007, "product": "phone"}
{"event_date": "2019-01-30", "user_id": 1, "timestamp": "2019-01-30 08:00:00", "eventID": 1009, "product": "phone"}
{"event_date": "2019-02-01", "user_id": 1, "timestamp": "2019-02-01 08:00:00", "eventID": 1010, "product": "phone"}' > trend.ndjson
tb datasource append trend trend.ndjson
SELECT
level,
count() AS c
FROM
(
SELECT
user_id,
windowFunnel(604800)(timestamp, eventID = 1003, eventID = 1009, eventID = 1007, eventID = 1010) AS level
FROM trend
WHERE (event_date >= '2019-01-01') AND (event_date <= '2019-02-02')
GROUP BY user_id
)
GROUP BY level
ORDER BY level ASC
Result:
┌─level─┬─c─┐ │ 4 │ 1 │ └───────┴───┘
retention¶
Calculates an array of boolean values indicating whether a series of conditions were met for a given event, relative to the first condition.
Syntax¶
retention(cond1, cond2, ..., cond32)
Arguments¶
cond: UInt8. A boolean expression (0 or 1) representing a condition. Up to 32 conditions can be provided.
Returns¶
An array of UInt8 values. The first element indicates if cond1 was met. Subsequent elements N indicate if cond1 AND condN were met. Type: Array(UInt8).
Example¶
Calculate user retention across three specific dates.
SELECT
uid,
retention(date = '2020-01-01', date = '2020-01-02', date = '2020-01-03') AS r
FROM (
SELECT '2020-01-01' as date, number as uid FROM numbers(5) UNION ALL
SELECT '2020-01-02' as date, number as uid FROM numbers(10) UNION ALL
SELECT '2020-01-03' as date, number as uid FROM numbers(15))
WHERE date IN ('2020-01-01', '2020-01-02', '2020-01-03')
GROUP BY uid
ORDER BY uid ASC
Result:
┌─uid─┬─r───────┐ │ 0 │ [1,1,1] │ │ 1 │ [1,1,1] │ │ 2 │ [1,1,1] │ │ 3 │ [1,1,1] │ │ 4 │ [1,1,1] │ │ 5 │ [0,0,0] │ │ 6 │ [0,0,0] │ │ 7 │ [0,0,0] │ │ 8 │ [0,0,0] │ │ 9 │ [0,0,0] │ │ 10 │ [0,0,0] │ │ 11 │ [0,0,0] │ │ 12 │ [0,0,0] │ │ 13 │ [0,0,0] │ │ 14 │ [0,0,0] │ └─────┴─────────┘
uniqUpTo¶
Calculates the number of distinct values for an argument, up to a specified limit N. If the count exceeds N, it returns N + 1. Otherwise, it returns the exact count.
Syntax¶
uniqUpTo(N)(x)
Arguments¶
x: The expression for which to count unique values.
Parameters¶
N: UInt. The maximum limit for the unique count. Recommended for small values (up to 10), with a maximum of 100.
Returns¶
The number of unique values, or N + 1 if the count exceeds N. Type: UInt64.
Example¶
Count unique numbers up to a limit of 3.
SELECT uniqUpTo(3)(number)
FROM (
SELECT number FROM system.numbers LIMIT 5
)
Result:
┌─uniqUpTo(3)(number)─┐ │ 4 │ └─────────────────────┘
sumMapFiltered¶
Aggregates values from maps, similar to sumMap, but only for keys specified in a filter array. This is useful for high-cardinality maps where you only need to sum specific keys.
Syntax¶
sumMapFiltered(keys_to_keep)(keys, values)
Arguments¶
keys: Array. An array of keys from the map.values: Array. An array of values corresponding to the keys.
Parameters¶
keys_to_keep: Array. An array of keys to include in the summation. Only values associated with these keys will be summed.
Returns¶
A tuple containing two arrays: the filtered and sorted keys, and their corresponding summed values. Type: Tuple(Array(KeyType), Array(ValueType)).
Example¶
Sum values for specific keys 1, 4, and 8 from a map.
SELECT sumMapFiltered([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summapfiltered
FROM (
SELECT
c1::Date as date,
c2::DateTime as timeslot,
(c3::Array(UInt16), c4::Array(UInt64))::Map(UInt16, UInt64) as statusMap
FROM VALUES(
('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10])
)
)
Result:
┌─────summapfiltered───┐ │ ([1,4,8],[10,20,10]) │ └──────────────────────┘
sumMapFilteredWithOverflow¶
Aggregates values from maps for specified keys, similar to sumMapFiltered, but performs summation with potential overflow. This means the return type for summed values will match the input value type, even if the sum exceeds its maximum capacity.
Syntax¶
sumMapFilteredWithOverflow(keys_to_keep)(keys, values)
Arguments¶
keys: Array. An array of keys from the map.values: Array. An array of values corresponding to the keys.
Parameters¶
keys_to_keep: Array. An array of keys to include in the summation. Only values associated with these keys will be summed.
Returns¶
A tuple containing two arrays: the filtered and sorted keys, and their corresponding summed values. The type of the summed values array will match the input values type. Type: Tuple(Array(KeyType), Array(ValueType)).
Example¶
Demonstrate sumMapFilteredWithOverflow where UInt8 values might overflow, compared to sumMapFiltered which promotes to UInt64 to prevent overflow.
SELECT
sumMapFilteredWithOverflow([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summap_overflow,
toTypeName(summap_overflow)
FROM (
SELECT
c1::Date as date,
c2::DateTime as timeslot,
(c3::Array(UInt8), c4::Array(UInt8))::Map(UInt8, UInt8) as statusMap
FROM VALUES(
('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10])
)
)
Result:
┌─summap_overflow──────┬─toTypeName(summap_overflow)───────┐ │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt8)) │ └──────────────────────┴───────────────────────────────────┘
Compare with sumMapFiltered:
SELECT
sumMapFiltered([1, 4, 8])(mapKeys(statusMap),mapValues(statusMap)) as summap,
toTypeName(summap)
FROM (
SELECT
c1::Date as date,
c2::DateTime as timeslot,
(c3::Array(UInt8), c4::Array(UInt8))::Map(UInt8, UInt8) as statusMap
FROM VALUES(
('2000-01-01', '2000-01-01 00:00:00', [1, 2, 3], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:00:00', [3, 4, 5], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [4, 5, 6], [10, 10, 10]),
('2000-01-01', '2000-01-01 00:01:00', [6, 7, 8], [10, 10, 10])
)
)
Result:
┌─summap───────────────┬─toTypeName(summap)─────────────────┐ │ ([1,4,8],[10,20,10]) │ Tuple(Array(UInt8), Array(UInt64)) │ └──────────────────────┴────────────────────────────────────┘
sequenceNextNode¶
Identifies and returns the value of the next event in a sequence that matches a specified pattern, relative to a base event.
Syntax¶
sequenceNextNode(direction, base)(timestamp, event_column, base_condition, event1, event2, event3, ...)
Arguments¶
timestamp: Date, DateTime, or UInt. The column containing event timestamps.event_column: String or Nullable(String). The column whose value should be returned for the next matched event.base_condition: UInt8. A boolean condition that the base event must fulfill.event1,event2, ...: UInt8. Boolean conditions describing the chain of events to match after the base event.
Parameters¶
direction: String. Specifies the search direction:'forward': Searches for events occurring after the base event.'backward': Searches for events occurring before the base event.
base: String. Defines how the base event is determined:'head': The first event in the sequence that satisfiesbase_condition.'tail': The last event in the sequence that satisfiesbase_condition.'first_match': The first event that matchesevent1and satisfiesbase_condition.'last_match': The last event that matchesevent1and satisfiesbase_condition.
Returns¶
The value from event_column of the next matched event, or NULL if no match is found or the next event does not exist. Type: Nullable(String).
Example¶
Find the event that follows 'A' then 'B' in a sequence, starting from the head of the sequence.
SELECT
id,
sequenceNextNode('forward', 'head')(dt, page, page = 'A', page = 'A', page = 'B') AS next_flow
FROM (
SELECT 1 AS dt, 1 AS id, 'A' AS page UNION ALL
SELECT 2 AS dt, 1 AS id, 'B' AS page UNION ALL
SELECT 3 AS dt, 1 AS id, 'C' AS page UNION ALL
SELECT 4 AS dt, 1 AS id, 'D' AS page UNION ALL
SELECT 5 AS dt, 1 AS id, 'E' AS page
) AS test_flow
GROUP BY id
Result:
┌─id─┬─next_flow─┐ │ 1 │ C │ └────┴───────────┘