Aggregate functions

Aggregate functions perform a calculation on a set of values and return a single value. They are commonly used in data analysis and reporting to summarize large datasets. These functions can be used to compute sums, averages, counts, and other statistical measures.

During aggregation, NULL arguments are generally skipped. If an aggregate function has multiple arguments, any row where one or more of these arguments are NULL will be ignored. Some functions, like first_value or last_value with the RESPECT NULLS modifier, are exceptions to this rule.

Aggregate function combinators

The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.

-If

The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).

Examples: sumIf(column, cond), countIf(cond), avgIf(x, cond), quantilesTimingIf(level1, level2)(x, cond), argMinIf(arg, val, cond) and so on.

With conditional aggregate functions, you can calculate aggregates for several conditions at once, without using subqueries and JOINs. For example, conditional aggregate functions can be used to implement the segment comparison functionality.

-Array

The -Array suffix can be appended to any aggregate function. In this case, the aggregate function takes arguments of the 'Array(T)' type (arrays) instead of 'T' type arguments. If the aggregate function accepts multiple arguments, this must be arrays of equal lengths. When processing arrays, the aggregate function works like the original aggregate function across all array elements.

Example 1: sumArray(arr) - Totals all the elements of all 'arr' arrays. In this example, it could have been written more simply: sum(arraySum(arr)).

Example 2: uniqArray(arr) – Counts the number of unique elements in all 'arr' arrays. This could be done an easier way: uniq(arrayJoin(arr)), but it's not always possible to add 'arrayJoin' to a query.

-If and -Array can be combined. However, 'Array' must come first, then 'If'. Examples: uniqArrayIf(arr, cond), quantilesTimingArrayIf(level1, level2)(arr, cond). Due to this order, the 'cond' argument won't be an array.

-Map

The -Map suffix can be appended to any aggregate function. This will create an aggregate function which gets Map type as an argument, and aggregates values of each key of the map separately using the specified aggregate function. The result is also of a Map type.

Example

WITH map_map AS (
  SELECT c1::Date AS date, c2::DateTime AS timeslot, c3::Map(String, UInt16) AS status FROM values (
    ('2000-01-01', '2000-01-01 00:00:00', (['a', 'b', 'c'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:00:00', (['c', 'd', 'e'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['d', 'e', 'f'], [10, 10, 10])),
    ('2000-01-01', '2000-01-01 00:01:00', (['f', 'g', 'g'], [10, 10, 10]))
))
SELECT
    timeslot,
    sumMap(status),
    avgMap(status),
    minMap(status)
FROM map_map
GROUP BY timeslot

Result:

┌────────────timeslot─┬─sumMap(status)───────────────────────┬─avgMap(status)───────────────────────┬─minMap(status)───────────────────────┐
│ 2000-01-01 00:00:00 │ {'a':10,'b':10,'c':20,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │
│ 2000-01-01 00:01:00 │ {'d':10,'e':10,'f':20,'g':20}        │ {'d':10,'e':10,'f':10,'g':10}        │ {'d':10,'e':10,'f':10,'g':10}        │
└─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘

-SimpleState

If you apply this combinator, the aggregate function returns the same value but with a different type. This is a SimpleAggregateFunction(...) that can be stored in a table to work with AggregatingMergeTree tables.

Syntax

<aggFunction>SimpleState(x)

Arguments

  • x: Aggregate function parameters.

Returned values

The value of an aggregate function with the SimpleAggregateFunction(...) type.

Example

Query:

WITH anySimpleState(number) AS c SELECT toTypeName(c), c FROM numbers(1)

Result:

┌─toTypeName(c)────────────────────────┬─c─┐
│ SimpleAggregateFunction(any, UInt64) │ 0 │
└──────────────────────────────────────┴───┘

-State

If you apply this combinator, the aggregate function doesn't return the resulting value (such as the number of unique values for the uniq function), but an intermediate state of the aggregation (for uniq, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later.

-MapState isn't an invariant for the same data because the order of data in intermediate state can change, though it doesn't impact ingestion of this data.

To work with these states, use:

  • AggregatingMergeTree table engine.
  • finalizeAggregation function.
  • runningAccumulate function.
  • -Merge combinator.
  • -MergeState combinator.

-Merge

If you apply this combinator, the aggregate function takes the intermediate aggregation state as an argument, combines the states to finish aggregation, and returns the resulting value.

-MergeState

Merges the intermediate aggregation states in the same way as the -Merge combinator. However, it doesn't return the resulting value, but an intermediate aggregation state, similar to the -State combinator.

-ForEach

Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, sumForEach for the arrays [1, 2], [3, 4, 5]and[6, 7]returns the result [10, 13, 5] after adding together the corresponding array items.

-Distinct

Every unique combination of arguments will be aggregated only once. Repeating values are ignored. Examples: sum(DISTINCT x) (or sumDistinct(x)), groupArray(DISTINCT x) (or groupArrayDistinct(x)), corrStable(DISTINCT x, y) (or corrStableDistinct(x, y)) and so on.

-OrDefault

Changes behavior of an aggregate function.

If an aggregate function doesn't have input values, with this combinator it returns the default value for its return data type. Applies to the aggregate functions that can take empty input data.

-OrDefault can be used with other combinators.

Syntax

<aggFunction>OrDefault(x)

Arguments

  • x: Aggregate function parameters.

Returned values

Returns the default value of an aggregate function's return type if there is nothing to aggregate.

Type depends on the aggregate function used.

Example

Query:

SELECT avg(number), avgOrDefault(number) FROM numbers(0)

Result:

┌─avg(number)─-─avgOrDefault(number)─┐
│         nan │                    0 │
└─────────────┴──────────────────────┘

Also -OrDefault can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.

Query:

SELECT avgOrDefaultIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrDefaultIf(x, greater(x, 10))─┐
│                              0.00 │
└───────────────────────────────────┘

-OrNull

Changes behavior of an aggregate function.

This combinator converts a result of an aggregate function to the Nullable data type. If the aggregate function doesn't have values to calculate it returns NULL.

-OrNull can be used with other combinators.

Syntax

<aggFunction>OrNull(x)

Arguments

  • x: Aggregate function parameters.

Returned values

  • The result of the aggregate function, converted to the Nullable data type.
  • NULL, if there is nothing to aggregate.

Type: Nullable(aggregate function return type).

Example

Add -orNull to the end of aggregate function.

Query:

SELECT sumOrNull(number), toTypeName(sumOrNull(number)) FROM numbers(10) WHERE number > 10

Result:

┌─sumOrNull(number)─┬─toTypeName(sumOrNull(number))─┐
│              ᴺᵁᴸᴸ │ Nullable(UInt64)              │
└───────────────────┴───────────────────────────────┘

Also -OrNull can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.

Query:

SELECT avgOrNullIf(x, x > 10)
FROM
(
    SELECT toDecimal32(1.23, 2) AS x
)

Result:

┌─avgOrNullIf(x, greater(x, 10))─┐
│                           ᴺᵁᴸᴸ │
└────────────────────────────────┘

-Resample

Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.

<aggFunction>Resample(start, end, step)(<aggFunction_params>, resampling_key)

Arguments

  • start: Starting value of the whole required interval for resampling_key values.
  • stop: Ending value of the whole required interval for resampling_key values. The whole interval doesn't include the stop value [start, stop).
  • step: Step for separating the whole interval into subintervals. The aggFunction is executed over each of those subintervals independently.
  • resampling_key: Column whose values are used for separating data into intervals.
  • aggFunction_params: aggFunction parameters.

Returned values

  • Array of aggFunction results for each subinterval.

Example

Consider the people table with the following data:

┌─name───┬─age─┬─wage─┐
│ John   │  16 │   10 │
│ Alice  │  30 │   15 │
│ Mary   │  35 │    8 │
│ Evelyn │  48 │ 11.5 │
│ David  │  62 │  9.9 │
│ Brian  │  60 │   16 │
└────────┴─────┴──────┘

Let's get the names of the people whose age lies in the intervals of [30,60) and [60,75). Since we use integer representation for age, we get ages in the [30, 59] and [60,74] intervals.

To aggregate names in an array, we use the groupArray aggregate function. It takes one argument. In our case, it's the name column. The groupArrayResample function should use the age column to aggregate names by age. To define the required intervals, we pass the 30, 75, 30 arguments into the groupArrayResample function.

SELECT groupArrayResample(30, 75, 30)(name, age) FROM people

Result:

┌─groupArrayResample(30, 75, 30)(name, age)─────┐
│ [['Alice','Mary','Evelyn'],['David','Brian']] │
└───────────────────────────────────────────────┘

Consider the results.

John is out of the sample because he's too young. Other people are distributed according to the specified age intervals.

Now let's count the total number of people and their average wage in the specified age intervals.

SELECT
    countResample(30, 75, 30)(name, age) AS amount,
    avgResample(30, 75, 30)(wage, age) AS avg_wage
FROM people

Result:

┌─amount─┬─avg_wage──────────────────┐
│ [3,2]  │ [11.5,12.949999809265137] │
└────────┴───────────────────────────┘

-ArgMin

The suffix -ArgMin can be appended to the name of any aggregate function. In this case, the aggregate function accepts an additional argument, which should be any comparable expression. The aggregate function processes only the rows that have the minimum value for the specified extra expression.

Examples: sumArgMin(column, expr), countArgMin(expr), avgArgMin(x, expr) and so on.

-ArgMax

Similar to suffix -ArgMin but processes only the rows that have the maximum value for the specified extra expression.

analysisOfVariance

Performs a one-way analysis of variance (ANOVA) test to determine if the means of two or more groups are significantly different.

Syntax

analysisOfVariance(val, group_no)

Alias: anova

Arguments

  • val: Numeric value for the observation.
  • group_no: Integer representing the group the val belongs to. Groups are enumerated starting from 0.

Returns

A tuple containing the F-statistic and the p-value of the ANOVA test. Tuple(Float64, Float64).

Example

SELECT analysisOfVariance(number, number % 2) FROM numbers(100)

Result:

┌─analysisOfVariance(number, modulo(number, 2))─┐
│ (0,1)                                         │
└───────────────────────────────────────────────┘

any

Selects the first encountered non-NULL value from a column within a group. The order of selection is non-deterministic unless the data is explicitly ordered.

Syntax

any(column) [RESPECT NULLS]

Aliases: any_value, first_value. For any(column) RESPECT NULLS: any_respect_nulls, first_value_respect_nulls, any_value_respect_nulls.

Arguments

  • column: The column from which to select a value.

Returns

The first value encountered. If no non-NULL values are found, it returns the default value for the column's type (e.g., 0 for integers, empty string for strings). If RESPECT NULLS is used, it returns the first value encountered, even if it's NULL. The return type matches the input column's type.

Example

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT any(city), any_respect_nulls(city) FROM cte

Result:

┌─any(city)─┬─any_respect_nulls(city)─┐
│ Amsterdam │ ᴺᵁᴸᴸ                    │
└───────────┴─────────────────────────┘

anyHeavy

Identifies a frequently occurring value in a column using the heavy hitters algorithm. This function is non-deterministic and aims to find a value that appears in more than half of the cases within each execution thread.

Syntax

anyHeavy(column)

Arguments

  • column: The column from which to find a frequently occurring value.

Returns

A frequently occurring value from the column. The return type matches the input column's type.

Example

WITH cte AS (SELECT arrayJoin([2,1,1,1,3,1,1,2,2]) as n)
SELECT any(n), anyHeavy(n) FROM cte

Result:

┌─any(n)─┬─anyHeavy(n)─┐
│      2 │           1 │
└────────┴─────────────┘

anyLast

Selects the last encountered non-NULL value from a column within a group. The order of selection is non-deterministic unless the data is explicitly ordered.

Syntax

anyLast(column) [RESPECT NULLS]

Alias: last_value. For anyLast(column) RESPECT NULLS: anyLast_respect_nulls, last_value_respect_nulls.

Arguments

  • column: The column from which to select a value.

Returns

The last value encountered. If no non-NULL values are found, it returns the default value for the column's type. If RESPECT NULLS is used, it returns the last value encountered, even if it's NULL. The return type matches the input column's type.

Example

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT anyLast(city), anyLast_respect_nulls(city) FROM cte

Result:

┌─anyLast(city)─┬─anyLast_respect_nulls(city)─┐
│ Valencia      │ ᴺᵁᴸᴸ                        │
└───────────────┴─────────────────────────────┘

approx_top_k

Returns an array of the approximately most frequent values and their counts from a column. The result is sorted by approximate frequency in descending order.

Syntax

approx_top_k(N)(column)
approx_top_k(N, reserved)(column)

Alias: approx_top_count

Parameters

  • N: The maximum number of elements to return. Optional. Default: 10.
  • reserved: The number of cells reserved for values. If the number of unique values exceeds reserved, the result becomes approximate. Optional. Default: N * 3.

Arguments

  • column: The column whose value frequency is to be calculated.

Returns

An array of tuples, where each tuple contains a value, its approximate count, and an error value. Array(Tuple(T, UInt64, UInt64)).

Example

SELECT approx_top_k(2)(k)
FROM values('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_k(2)(k)────┐
│ [('y',3,0),('x',1,0)] │
└───────────────────────┘

approx_top_sum

Returns an array of the approximately most frequent values and their weighted counts from a column. The result is sorted by approximate weighted frequency in descending order.

Syntax

approx_top_sum(N)(column, weight)
approx_top_sum(N, reserved)(column, weight)

Parameters

  • N: The maximum number of elements to return. Optional. Default: 10.
  • reserved: The number of cells reserved for values. If the number of unique values exceeds reserved, the result becomes approximate. Optional. Default: N * 3.

Arguments

  • column: The column whose value frequency is to be calculated.
  • weight: A UInt64 value representing the weight for each column entry.

Returns

An array of tuples, where each tuple contains a value, its approximate weighted count, and an error value. Array(Tuple(T, UInt64, UInt64)).

Example

SELECT approx_top_sum(2)(k, w)
FROM values('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─approx_top_sum(2)(k, w)─┐
│ [('z',10,0),('x',5,0)]  │
└─────────────────────────┘

argMax

Retrieves the arg value associated with the maximum val value within a group. If multiple val values are tied for the maximum, the returned arg is non-deterministic. NULL values in arg or val are skipped.

Syntax

argMax(arg, val)

Arguments

  • arg: The argument whose value is to be returned.
  • val: The value used to determine the maximum.

Returns

The arg value corresponding to the maximum val. The return type matches the arg type.

Example

WITH salary_data AS (
  SELECT user, salary FROM values('user String, salary UInt32', ('director', 5000), ('manager', 3000), ('worker', 1000))
)
SELECT argMax(user, salary) FROM salary_data

Result:

┌─argMax(user, salary)─┐
│ director             │
└──────────────────────┘

argMin

Retrieves the arg value associated with the minimum val value within a group. If multiple val values are tied for the minimum, the returned arg is non-deterministic. NULL values in arg or val are skipped.

Syntax

argMin(arg, val)

Arguments

  • arg: The argument whose value is to be returned.
  • val: The value used to determine the minimum.

Returns

The arg value corresponding to the minimum val. The return type matches the arg type.

Example

WITH salary_data AS (
  SELECT user, salary FROM values('user String, salary UInt32', ('director', 5000), ('manager', 3000), ('worker', 1000))
)
SELECT argMin(user, salary) FROM salary_data

Result:

┌─argMin(user, salary)─┐
│ worker               │
└──────────────────────┘

avg

Calculates the arithmetic mean (average) of a set of numeric values.

Syntax

avg(x)

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The arithmetic mean as a Float64. Returns NaN if the input set is empty.

Example

SELECT avg(x) FROM values('x Int8', 0, 1, 2, 3, 4, 5)

Result:

┌─avg(x)─┐
│    2.5 │
└────────┘

avgWeighted

Calculates the weighted arithmetic mean of a set of values, where each value contributes according to its specified weight.

Syntax

avgWeighted(x, weight)

Arguments

  • x: The numeric values (Integer or Float).
  • weight: The numeric weights corresponding to each value (Integer or Float).

Returns

The weighted arithmetic mean as a Float64. Returns NaN if all weights are zero or the input set is empty.

Example

SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (4, 1), (1, 0), (10, 2))

Result:

┌─avgWeighted(x, weight)─┐
│                      8 │
└────────────────────────┘

boundingRatio

Calculates the slope of the line connecting the leftmost and rightmost points in a series of (x, y) coordinates.

Syntax

boundingRatio(x, y)

Arguments

  • x: The x-coordinate values.
  • y: The y-coordinate values.

Returns

The slope as a Float64.

Example

SELECT boundingRatio(number, number * 1.5)
FROM numbers(10)

Result:

┌─boundingRatio(number, multiply(number, 1.5))─┐
│                                          1.5 │
└──────────────────────────────────────────────┘

categoricalInformationValue

Calculates the Information Value (IV) for categorical features against a binary target variable. This metric helps assess the predictive power of categorical variables.

Syntax

categoricalInformationValue(category1, category2, ..., tag)

Arguments

  • category1, category2, ...: One or more categorical columns to analyze.
  • tag: The binary target variable (ee.g., 0 or 1).

Returns

A Float64 representing the Information Value.

Example

SELECT categoricalInformationValue(category, target)
FROM values('category String, target UInt8', ('A', 0), ('A', 1), ('B', 0), ('B', 0), ('C', 1))

Result:

┌─categoricalInformationValue(category, target)─┐
│                                           0.0 │
└───────────────────────────────────────────────┘

contingency

Calculates the contingency coefficient, a measure of association between two categorical variables. The coefficient ranges from 0 to 1, where higher values indicate a stronger association.

Syntax

contingency(column1, column2)

Arguments

  • column1: The first categorical column.
  • column2: The second categorical column.

Returns

A Float64 value between 0 and 1.

Example

SELECT
    contingency(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌─contingency(a, b)─┐
│ 0.05812725261759165 │
└───────────────────┘

corr

Calculates the Pearson correlation coefficient between two numeric variables. This function uses a numerically unstable algorithm.

Syntax

corr(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The Pearson correlation coefficient as a Float64.

Example

SELECT corr(c2, c3)
FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─corr(c2, c3)──────────┐
│    0.1730265755453256 │
└───────────────────────┘

corrMatrix

Computes the correlation matrix for a set of numeric variables.

Syntax

corrMatrix(x[, ...])

Arguments

  • x: One or more numeric variables (Integer, Float, or Decimal).

Returns

A nested array representing the correlation matrix. Array(Array(Float64)).

Example

SELECT
  arrayMap(x -> round(x, 3), arrayJoin(corrMatrix(c1, c2, c3, c4))) AS corrMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─corrMatrix─────────────┐
│ [1,-0.096,0.243,0.746] │
│ [-0.096,1,0.173,0.106] │
│ [0.243,0.173,1,0.258]  │
│ [0.746,0.106,0.258,1]  │
└────────────────────────┘

corrStable

Calculates the Pearson correlation coefficient between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than corr.

Syntax

corrStable(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The Pearson correlation coefficient as a Float64.

Example

SELECT corrStable(c2, c3)
FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─corrStable(c2, c3)─────┐
│    0.17302657554532558 │
└────────────────────────┘

count

Counts the number of rows or the number of non-NULL values in a specified column.

Syntax

count()
count(expr)

Arguments

  • (No arguments): Counts all rows.
  • expr: An expression or column name. Counts rows where expr is not NULL.

Returns

The count as a UInt64.

Example

SELECT count(), count(nullable_col)
FROM values('nullable_col Nullable(String)', 'A', 'B', NULL, 'C', NULL)

Result:

┌─count()─┬─count(nullable_col)─┐
│       5 │                   3 │
└─────────┴─────────────────────┘

covarPop

Calculates the population covariance between two numeric variables. This function uses a numerically unstable algorithm.

Syntax

covarPop(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The population covariance as a Float64.

Example

SELECT covarPop(c2, c3)
FROM values ((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))

Result:

┌─covarPop(c2, c3)───┐
│           6.485648 │
└────────────────────┘

covarPopMatrix

Computes the population covariance matrix for a set of numeric variables.

Syntax

covarPopMatrix(x[, ...])

Arguments

  • x: One or more numeric variables (Integer, Float, or Decimal).

Returns

A nested array representing the population covariance matrix. Array(Array(Float64)).

Example

SELECT arrayMap(x -> round(x, 3), arrayJoin(covarPopMatrix(c1, c2, c3, c4))) AS covarPopMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─covarPopMatrix────────────┐
│ [8.25,-1.76,4.08,6.748]   │
│ [-1.76,41.07,6.486,2.132] │
│ [4.08,6.486,34.21,4.755]  │
│ [6.748,2.132,4.755,9.93]  │
└───────────────────────────┘

covarPopStable

Calculates the population covariance between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than covarPop.

Syntax

covarPopStable(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The population covariance as a Float64.

Example

SELECT covarPopStable(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarPopStable(c2, c3)─┐
│               6.485648 │
└────────────────────────┘

covarSamp

Calculates the sample covariance between two numeric variables. This function uses a numerically unstable algorithm.

Syntax

covarSamp(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The sample covariance as a Float64. Returns NaN if the sample size is 1 or less.

Example

SELECT covarSamp(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarSamp(c2, c3)───┐
│   7.206275555555556 │
└─────────────────────┘

covarSampMatrix

Computes the sample covariance matrix for a set of numeric variables.

Syntax

covarSampMatrix(x[, ...])

Arguments

  • x: One or more numeric variables (Integer, Float, or Decimal).

Returns

A nested array representing the sample covariance matrix. Array(Array(Float64)).

Example

SELECT arrayMap(x -> round(x, 3), arrayJoin(covarSampMatrix(c1, c2, c3, c4))) AS covarSampMatrix
FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))

Result:

┌─covarSampMatrix─────────────┐
│ [9.167,-1.956,4.534,7.498]  │
│ [-1.956,45.634,7.206,2.369] │
│ [4.534,7.206,38.011,5.283]  │
│ [7.498,2.369,5.283,11.034]  │
└─────────────────────────────┘

covarSampStable

Calculates the sample covariance between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than covarSamp.

Syntax

covarSampStable(x, y)

Arguments

  • x: The first numeric variable (Integer, Float, or Decimal).
  • y: The second numeric variable (Integer, Float, or Decimal).

Returns

The sample covariance as a Float64. Returns inf if the sample size is 1 or less.

Example

SELECT covarSampStable(c2, c3)
FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))

Result:

┌─covarSampStable(x_value, y_value)─┐
│                 7.206275555555556 │
└───────────────────────────────────┘

cramersV

Calculates Cramer's V, a measure of association between two nominal variables. The value ranges from 0 (no association) to 1 (complete association).

Syntax

cramersV(column1, column2)

Arguments

  • column1: The first categorical column.
  • column2: The second categorical column.

Returns

A Float64 value between 0 and 1.

Example

SELECT
    cramersV(a, b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 5 AS b
        FROM
            numbers(150)
    )

Result:

┌─────cramersV(a, b)─┐
│ 0.8944271909999159 │
└────────────────────┘

cramersVBiasCorrected

Calculates a bias-corrected version of Cramer's V, a measure of association between two nominal variables. This version aims to provide a more accurate estimate of the association, especially for smaller sample sizes.

Syntax

cramersVBiasCorrected(column1, column2)

Arguments

  • column1: The first categorical column.
  • column2: The second categorical column.

Returns

A Float64 value between 0 and 1.

Example

SELECT
    cramersV(a, b),
    cramersVBiasCorrected(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌──────cramersV(a, b)─┬─cramersVBiasCorrected(a, b)─┐
│ 0.41171788506213564 │         0.33369281784141364 │
└─────────────────────┴─────────────────────────────┘

deltaSum

Calculates the sum of positive differences between consecutive values in a sequence. Negative differences are ignored. This function assumes the input data is already sorted.

Syntax

deltaSum(value)

Arguments

  • value: A numeric expression (Integer or Float).

Returns

The sum of positive differences, with a type matching the input value.

Example

SELECT deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))

Result:

┌─deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))─┐
│                                             7 │
└───────────────────────────────────────────────┘

deltaSumTimestamp

Calculates the sum of positive differences between consecutive values, ordered by a specified timestamp. This function is useful for materialized views where rows might not be physically ordered by the value itself but by a timestamp.

Syntax

deltaSumTimestamp(value, timestamp)

Arguments

  • value: The numeric value (Integer, Float, Date, or DateTime).
  • timestamp: The timestamp or ordering key (Integer, Float, Date, or DateTime).

Returns

The accumulated differences, with a type matching the value argument.

Example

SELECT deltaSumTimestamp(value, timestamp)
FROM (SELECT number AS timestamp, [0, 4, 8, 3, 0, 0, 0, 1, 3, 5][number] AS value FROM numbers(1, 10))

Result:

┌─deltaSumTimestamp(value, timestamp)─┐
│                                  13 │
└─────────────────────────────────────┘

entropy

Calculates the Shannon entropy of a column of values, which measures the average amount of information or uncertainty in the data.

Syntax

entropy(val)

Arguments

  • val: A column of values of any data type.

Returns

The Shannon entropy as a Float64.

Example

SELECT entropy(c1), entropy(c2) from values((1, 'A'), (1, 'A'), (1,'A'), (1,'A'), (2,'B'), (2,'B'), (2,'C'), (2,'D'))

Result:

┌─entropy(c1)─┬─entropy(c2)─┐
│           1 │        1.75 │
└─────────────┴─────────────┘

exponentialMovingAverage

Calculates the exponential moving average of a series of values, giving more weight to recent observations. The half-life parameter x determines the decay rate of older values.

Syntax

exponentialMovingAverage(x)(value, timeunit)

Parameters

  • x: The half-life period, a numeric value (Integer, Float, or Decimal).

Arguments

  • value: The numeric value (Integer, Float, or Decimal).
  • timeunit: A numeric index representing time units (Integer, Float, or Decimal). This is not a timestamp but an index of the time interval.

Returns

The exponentially smoothed moving average at the latest point in time as a Float64.

Example

WITH data AS (
  SELECT
    number AS timestamp,
    95 + (number % 5) AS temperature
  FROM numbers(20)
)
SELECT exponentialMovingAverage(5)(temperature, timestamp) FROM data

Result:

┌─exponentialMovingAverage(5)(temperature, timestamp)──┐
│                                    92.25779635374204 │
└──────────────────────────────────────────────────────┘

exponentialTimeDecayedAvg

Calculates the exponentially smoothed weighted moving average of a time series at a given point t, where older values have less influence.

Syntax

exponentialTimeDecayedAvg(x)(v, t)

Parameters

  • x: The half-life period, a numeric value (Integer, Float, or Decimal).

Arguments

  • v: The numeric value (Integer, Float, or Decimal).
  • t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).

Returns

The exponentially smoothed weighted moving average at time t as a Float64.

Example

SELECT
    value,
    time,
    round(exp_smooth, 3) AS exp_smooth
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedAvg(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(5)
    )

Result:

┌─value─┬─time─┬─exp_smooth─┐
│     1 │    0 │          1 │
│     0 │    1 │      0.475 │
│     0 │    2 │      0.301 │
│     0 │    3 │      0.214 │
│     0 │    4 │      0.162 │
└───────┴──────┴────────────┘

exponentialTimeDecayedCount

Returns the cumulative exponential decay over a time series at a given point t. This function quantifies the total "presence" of events, with more recent events contributing more.

Syntax

exponentialTimeDecayedCount(x)(t)

Parameters

  • x: The half-life period, a numeric value (Integer, Float, or Decimal).

Arguments

  • t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).

Returns

The cumulative exponential decay at the given time point as a Float64.

Example

SELECT
    value,
    time,
    round(exp_smooth, 3) AS exp_smooth
FROM
(
    SELECT
        (number % 5) = 0 AS value,
        number AS time,
        exponentialTimeDecayedCount(10)(time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(5)
)

Result:

┌─value─┬─time─┬─exp_smooth─┐
│     1 │    0 │          1 │
│     0 │    1 │      1.905 │
│     0 │    2 │      2.724 │
│     0 │    3 │      3.464 │
│     0 │    4 │      4.135 │
└───────┴──────┴────────────┘

exponentialTimeDecayedMax

Returns the maximum of the exponentially smoothed weighted moving average at time t compared to the value at t-1. This helps track the peak influence of recent events.

Syntax

exponentialTimeDecayedMax(x)(value, timeunit)

Parameters

  • x: The half-life period, a numeric value (Integer, Float, or Decimal).

Arguments

  • value: The numeric value (Integer, Float, or Decimal).
  • timeunit: The time point (Integer, Float, Decimal, DateTime, or DateTime64).

Returns

The maximum of the exponentially smoothed weighted moving average at t and t-1 as a Float64.

Example

SELECT
    value,
    time,
    round(exp_smooth, 3) AS exp_smooth
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedMax(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(5)
    )

Result:

┌─value─┬─time─┬─exp_smooth─┐
│     1 │    0 │          1 │
│     0 │    1 │      0.905 │
│     0 │    2 │      0.819 │
│     0 │    3 │      0.741 │
│     0 │    4 │       0.67 │
└───────┴──────┴────────────┘

exponentialTimeDecayedSum

Returns the sum of exponentially smoothed moving average values of a time series at a given point t. This function provides a weighted sum where recent values contribute more.

Syntax

exponentialTimeDecayedSum(x)(v, t)

Parameters

  • x: The half-life period, a numeric value (Integer, Float, or Decimal).

Arguments

  • v: The numeric value (Integer, Float, or Decimal).
  • t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).

Returns

The sum of exponentially smoothed moving average values at the given time point as a Float64.

Example

SELECT
    value,
    time,
    round(exp_smooth, 3) AS exp_smooth
FROM
    (
    SELECT
    (number = 0) OR (number >= 25) AS value,
    number AS time,
    exponentialTimeDecayedSum(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
    FROM numbers(5)
    )

Result:

┌─value─┬─time─┬─exp_smooth─┐
│     1 │    0 │          1 │
│     0 │    1 │      0.905 │
│     0 │    2 │      0.819 │
│     0 │    3 │      0.741 │
│     0 │    4 │       0.67 │
└───────┴──────┴────────────┘

first_value

Selects the first encountered value of a column. This function is an alias for any but explicitly supports the RESPECT NULLS modifier, making it suitable for window functions or when NULL values need to be considered.

Syntax

first_value(column) [RESPECT NULLS]

Arguments

  • column: The column from which to select the first value.

Returns

The first value encountered. If RESPECT NULLS is used, it returns the first value, even if it's NULL. Otherwise, it skips NULLs. The return type matches the input column's type.

Example

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT first_value(city), first_value(city) RESPECT NULLS FROM cte

Result:

┌─first_value(city)─┬─first_value(city) RESPECT NULLS─┐
│ Amsterdam         │ ᴺᵁᴸᴸ                            │
└───────────────────┴─────────────────────────────────┘

groupArray

Creates an array containing all values of an argument within a group. The order of elements in the resulting array is non-deterministic.

Syntax

groupArray(x)
groupArray(max_size)(x)

Alias: array_agg

Arguments

  • x: The expression or column whose values are to be collected into an array.
  • max_size: An optional UInt64 parameter that limits the size of the resulting array.

Returns

An array containing the collected values. NULL values are excluded from the array. Array(T).

Example

WITH data AS (SELECT number % 3 AS id, toString(number) AS name FROM numbers(5))
SELECT id, groupArray(name) FROM data GROUP BY id ORDER BY id

Result:

┌─id─┬─groupArray(name)─┐
│  0 │ ['0','3']        │
│  1 │ ['1','4']        │
│  2 │ ['2']            │
└────┴──────────────────┘

groupArrayInsertAt

Inserts a value into an array at a specified position. If multiple values are inserted into the same position within a single-threaded query, the first value is used. In multi-threaded queries, the result is non-deterministic.

Syntax

groupArrayInsertAt(default_x, size)(x, pos)

Arguments

  • x: The value to insert.
  • pos: The UInt32 index where x should be inserted (0-based).
  • default_x: An optional expression for the default value to fill empty positions. If omitted, the default value for the data type of x is used.
  • size: An optional UInt32 parameter specifying the fixed length of the resulting array. Requires default_x to be specified.

Returns

An array with the value inserted at the specified position. Array(T).

Example

SELECT groupArrayInsertAt('-', 5)(toString(number), number * 2) FROM numbers(3)

Result:

┌─groupArrayInsertAt('-', 5)(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2']                                             │
└───────────────────────────────────────────────────────────────────┘

groupArrayIntersect

Returns an array containing elements that are present in all input arrays within a group (the intersection of arrays).

Syntax

groupArrayIntersect(x)

Arguments

  • x: An array expression or column.

Returns

An array containing only the common elements found in all input arrays. Array(T).

Example

WITH data AS (
  SELECT arrayJoin([
    [1,2,4],
    [1,5,2,8,-1,0],
    [1,5,7,5,8,2]
  ]) AS a
)
SELECT groupArrayIntersect(a) AS intersection FROM data

Result:

┌─intersection─┐
│ [1,2]        │
└──────────────┘

groupArrayLast

Creates an array containing the last encountered values of an argument within a group. The order of elements in the resulting array is non-deterministic unless the data is explicitly ordered.

Syntax

groupArrayLast(max_size)(x)

Arguments

  • max_size: An optional UInt64 parameter that limits the size of the resulting array.
  • x: The expression or column whose values are to be collected into an array.

Returns

An array containing the last collected values. Array(T).

Example

SELECT groupArrayLast(2)(number+1) AS numbers FROM numbers(10)

Result:

┌─numbers─┐
│ [9,10]  │
└─────────┘

groupArrayMovingAvg

Calculates the moving average of numeric input values over a specified window.

Syntax

groupArrayMovingAvg(numbers_for_summing)
groupArrayMovingAvg(window_size)(numbers_for_summing)

Arguments

  • numbers_for_summing: A numeric expression.
  • window_size: An optional UInt64 parameter defining the size of the moving window. If omitted, the window size defaults to the total number of rows.

Returns

An array of the same size and type as the input, where each element is the moving average. Array(T).

Example

SELECT groupArrayMovingAvg(2)(number) FROM numbers(5)

Result:

┌─groupArrayMovingAvg(2)(number)─┐
│ [0,0,1,2,3]                    │
└────────────────────────────────┘

groupArrayMovingSum

Calculates the moving sum of numeric input values over a specified window.

Syntax

groupArrayMovingSum(numbers_for_summing)
groupArrayMovingSum(window_size)(numbers_for_summing)

Arguments

  • numbers_for_summing: A numeric expression.
  • window_size: An optional UInt64 parameter defining the size of the moving window. If omitted, the window size defaults to the total number of rows.

Returns

An array of the same size and type as the input, where each element is the moving sum. Array(T).

Example

SELECT groupArrayMovingSum(2)(number) FROM numbers(5)

Result:

┌─groupArrayMovingSum(2)(number)─┐
│ [0,1,3,5,7]                    │
└────────────────────────────────┘

groupArraySample

Creates an array of randomly sampled argument values. The size of the resulting array is limited to max_size elements.

Syntax

groupArraySample(max_size[, seed])(x)

Arguments

  • max_size: The maximum UInt64 size of the resulting array.
  • seed: An optional UInt64 seed for the random number generator. Default: 123456.
  • x: The expression or column whose values are to be sampled.

Returns

An array of randomly selected x arguments. Array(T).

Example

WITH colors AS (SELECT arrayJoin(['red', 'blue', 'green', 'white', 'orange']) AS color)
SELECT groupArraySample(3)(color) AS newcolors FROM colors

Result:

┌─newcolors──────────────────┐
│ ['white','blue','green']   │
└────────────────────────────┘

groupArraySorted

Returns an array containing the first N items from a group, sorted in ascending order.

Syntax

groupArraySorted(N)(column)

Arguments

  • N: The UInt64 number of elements to return.
  • column: The column whose values are to be collected and sorted.

Returns

An array of the first N sorted items. Array(T).

Example

SELECT groupArraySorted(5)(number) FROM numbers(10)

Result:

┌─groupArraySorted(5)(number)─┐
│ [0,1,2,3,4]                 │
└─────────────────────────────┘

groupBitAnd

Applies a bit-wise AND operation to all numeric values within a group.

Syntax

groupBitAnd(expr)

Arguments

  • expr: A numeric expression (UInt* or Int*).

Returns

The result of the bit-wise AND operation, with a type matching expr.

Example

WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num)
SELECT groupBitAnd(num) FROM data

Result:

┌─groupBitAnd(num)─┐
│                4 │
└──────────────────┘

groupBitmap

Creates a bitmap from a column of unsigned integers and returns its cardinality (number of unique elements).

Syntax

groupBitmap(expr)

Arguments

  • expr: An unsigned integer expression (UInt*).

Returns

The cardinality of the bitmap as a UInt64.

Example

WITH data AS (SELECT arrayJoin([1, 1, 2, 3]) AS UserID)
SELECT groupBitmap(UserID) AS num FROM data

Result:

┌─num─┐
│   3 │
└─────┘

groupBitmapAnd

Calculates the cardinality of the bit-wise AND operation across multiple bitmap states.

Syntax

groupBitmapAnd(expr)

Arguments

  • expr: An expression that results in an AggregateFunction(groupBitmap, UInt*) state.

Returns

The cardinality of the resulting bitmap as a UInt64.

Example

WITH data AS (
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2)
  UNION ALL
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3)
)
SELECT groupBitmapAnd(bm) FROM data

Result:

┌─groupBitmapAnd(bm)─┐
│                  2 │
└────────────────────┘

groupBitmapOr

Calculates the cardinality of the bit-wise OR operation across multiple bitmap states. This is equivalent to groupBitmapMerge.

Syntax

groupBitmapOr(expr)

Arguments

  • expr: An expression that results in an AggregateFunction(groupBitmap, UInt*) state.

Returns

The cardinality of the resulting bitmap as a UInt64.

Example

WITH data AS (
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2)
  UNION ALL
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3)
)
SELECT groupBitmapOr(bm) FROM data

Result:

┌─groupBitmapOr(bm)─┐
│                 4 │
└───────────────────┘

groupBitmapXor

Calculates the cardinality of the bit-wise XOR operation across multiple bitmap states.

Syntax

groupBitmapXor(expr)

Arguments

  • expr: An expression that results in an AggregateFunction(groupBitmap, UInt*) state.

Returns

The cardinality of the resulting bitmap as a UInt64.

Example

WITH data AS (
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2)
  UNION ALL
  SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3)
)
SELECT groupBitmapXor(bm) FROM data

Result:

┌─groupBitmapXor(bm)─┐
│                  2 │
└────────────────────┘

groupBitOr

Applies a bit-wise OR operation to all numeric values within a group.

Syntax

groupBitOr(expr)

Arguments

  • expr: A numeric expression (UInt* or Int*).

Returns

The result of the bit-wise OR operation, with a type matching expr.

Example

WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num)
SELECT groupBitOr(num) FROM data

Result:

┌─groupBitOr(num)─┐
│             125 │
└─────────────────┘

groupBitXor

Applies a bit-wise XOR operation to all numeric values within a group.

Syntax

groupBitXor(expr)

Arguments

  • expr: A numeric expression (UInt* or Int*).

Returns

The result of the bit-wise XOR operation, with a type matching expr.

Example

WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num)
SELECT groupBitXor(num) FROM data

Result:

┌─groupBitXor(num)─┐
│              104 │
└──────────────────┘

groupConcat

Concatenates strings from a group into a single string, with optional delimiter and limit.

Syntax

groupConcat[(delimiter [, limit])](expression)

Arguments

  • expression: The string expression or column to concatenate.
  • delimiter: An optional String to separate concatenated values. Defaults to an empty string.
  • limit: An optional UInt64 to specify the maximum number of elements to concatenate.

Returns

A String containing the concatenated values. If the group is empty or contains only NULLs, it returns NULL (if the function doesn't specify NULL handling).

Example

WITH Employees AS (SELECT arrayJoin(['John', 'Jane', 'Bob']) AS Name)
SELECT groupConcat(', ', 2)(Name) FROM Employees

Result:

┌─groupConcat(', ', 2)(Name)─┐
│ John, Jane                 │
└────────────────────────────┘

groupUniqArray

Creates an array containing unique values of an argument within a group. Memory consumption is similar to uniqExact.

Syntax

groupUniqArray(x)
groupUniqArray(max_size)(x)

Arguments

  • x: The expression or column whose unique values are to be collected into an array.
  • max_size: An optional UInt64 parameter that limits the size of the resulting array.

Returns

An array containing the unique collected values. Array(T).

Example

WITH data AS (SELECT arrayJoin([1, 2, 2, 3, 1, 4]) AS num)
SELECT groupUniqArray(num) FROM data

Result:

┌─groupUniqArray(num)─┐
│ [1,2,3,4]           │
└─────────────────────┘

intervalLengthSum

Calculates the total length of the union of all intervals (segments on a numeric axis) defined by start and end points.

Syntax

intervalLengthSum(start, end)

Arguments

  • start: The starting value of the interval (numeric, Date, or DateTime).
  • end: The ending value of the interval (numeric, Date, or DateTime). Arguments must be of the same data type.

Returns

The total length of the union of intervals. UInt64 for integer/date types, Float64 for float types.

Example

WITH fl_interval AS (
  SELECT 'a' AS id, arrayJoin([(1.1, 2.9), (2.5, 3.2), (4.0, 5.0)]) AS interval_pair
)
SELECT id, intervalLengthSum(interval_pair.1, interval_pair.2) AS total_length
FROM fl_interval GROUP BY id

Result:

┌─id─┬─total_length─┐
│ a  │          3.1 │
└────┴──────────────┘

kolmogorovSmirnovTest

Applies the Kolmogorov-Smirnov test to compare two samples and determine if they come from the same continuous probability distribution.

Syntax

kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)

Parameters

  • alternative: Optional String specifying the alternative hypothesis ('two-sided', 'greater', 'less'). Default: 'two-sided'.
  • computation_method: Optional String specifying the p-value computation method ('exact', 'asymp', 'auto'). Default: 'auto'.

Arguments

  • sample_data: The numeric values from both samples (Integer, Float, or Decimal).
  • sample_index: An Integer (0 or 1) indicating which sample the sample_data belongs to.

Returns

A tuple containing the calculated statistic and the p-value. Tuple(Float64, Float64).

Example

SELECT kolmogorovSmirnovTest('less', 'exact')(value, num)
FROM
(
    SELECT
        randNormal(0, 10) AS value,
        0 AS num
    FROM numbers(100)
    UNION ALL
    SELECT
        randNormal(0, 10) AS value,
        1 AS num
    FROM numbers(100)
)

Result:

┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐
│ (0.009999999999999998,0.37528595205132287)         │
└────────────────────────────────────────────────────┘

kurtPop

Computes the population kurtosis of a sequence of numbers, which measures the "tailedness" of the probability distribution.

Syntax

kurtPop(expr)

Arguments

  • expr: A numeric expression.

Returns

The population kurtosis as a Float64.

Example

SELECT kurtPop(number) FROM numbers(10)

Result:

┌─kurtPop(number)─┐
│          -1.224 │
└─────────────────┘

kurtSamp

Computes the sample kurtosis of a sequence of numbers, providing an unbiased estimate of the kurtosis of a random variable from a sample.

Syntax

kurtSamp(expr)

Arguments

  • expr: A numeric expression.

Returns

The sample kurtosis as a Float64. Returns NaN if the sample size is 1 or less.

Example

SELECT kurtSamp(number) FROM numbers(10)

Result:

┌─kurtSamp(number)─┐
│ -1.224242424242424 │
└──────────────────┘

largestTriangleThreeBuckets

Applies the Largest-Triangle-Three-Buckets (LTTB) algorithm to downsample time series data for visualization. It selects N points that best represent the original series, preserving visual trends.

Syntax

largestTriangleThreeBuckets(n)(x, y)

Alias: lttb

Parameters

  • n: The UInt64 number of points to return in the downsampled series.

Arguments

  • x: The x-coordinate (Integer, Float, Decimal, Date, Date32, DateTime, or DateTime64).
  • y: The y-coordinate (Integer, Float, Decimal, Date, Date32, DateTime, or DateTime64). NaN values in the series are ignored.

Returns

An array of tuples, where each tuple contains the x and y coordinates of the selected points. Array(Tuple(T_x, T_y)).

Example

WITH data AS (
  SELECT arrayJoin([
    (1.0, 10.0), (2.0, 20.0), (3.0, 15.0), (8.0, 60.0),
    (9.0, 55.0), (10.0, 70.0), (4.0, 30.0), (5.0, 40.0),
    (6.0, 35.0), (7.0, 50.0)
  ]) AS point
)
SELECT largestTriangleThreeBuckets(4)(point.1, point.2) FROM data

Result:

┌─largestTriangleThreeBuckets(4)(point.1, point.2)─┐
│ [(1,10),(3,15),(9,55),(10,70)]                   │
└──────────────────────────────────────────────────┘

last_value

Selects the last encountered value of a column. This function is an alias for anyLast but explicitly supports the RESPECT NULLS modifier, making it suitable for window functions or when NULL values need to be considered.

Syntax

last_value(column) [RESPECT NULLS]

Arguments

  • column: The column from which to select the last value.

Returns

The last value encountered. If RESPECT NULLS is used, it returns the last value, even if it's NULL. Otherwise, it skips NULLs. The return type matches the input column's type.

Example

WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city)
SELECT last_value(city), last_value(city) RESPECT NULLS FROM cte

Result:

┌─last_value(city)─┬─last_value(city) RESPECT NULLS─┐
│ Valencia         │ ᴺᵁᴸᴸ                           │
└──────────────────┴────────────────────────────────┘

mannWhitneyUTest

Applies the Mann-Whitney U-test (also known as Wilcoxon rank-sum test) to compare two independent samples and determine if they come from the same distribution. This non-parametric test does not assume normal distribution.

Syntax

mannWhitneyUTest[(alternative[, continuity_correction])](sample_data, sample_index)

Parameters

  • alternative: Optional String specifying the alternative hypothesis ('two-sided', 'greater', 'less'). Default: 'two-sided'.
  • continuity_correction: Optional UInt64 (0 or 1) to apply continuity correction in the normal approximation for the p-value. Default: 1.

Arguments

  • sample_data: The numeric values from both samples (Integer, Float, or Decimal).
  • sample_index: An Integer (0 or 1) indicating which sample the sample_data belongs to.

Returns

A tuple containing the calculated U-statistic and the p-value. Tuple(Float64, Float64).

Example

WITH mww_ttest AS (
  SELECT arrayJoin([(10,0), (11,0), (12,0), (1,1), (2,1), (3,1)]) AS data_pair
)
SELECT mannWhitneyUTest('greater')(data_pair.1, data_pair.2) FROM mww_ttest

Result:

┌─mannWhitneyUTest('greater')(data_pair.1, data_pair.2)─┐
│ (9,0.04042779918503192)                               │
└───────────────────────────────────────────────────────┘

max

Calculates the maximum value within a group.

Syntax

max(expr)

Arguments

  • expr: Any comparable expression.

Returns

The maximum value. The return type matches the expr type.

Example

SELECT max(number) FROM numbers(10)

Result:

┌─max(number)─┐
│           9 │
└─────────────┘

maxIntersections

Calculates the maximum number of times that a group of intervals (defined by start and end points) overlap each other.

Syntax

maxIntersections(start_column, end_column)

Arguments

  • start_column: A numeric column representing the start of each interval. NULL or 0 values are skipped.
  • end_column: A numeric column representing the end of each interval. NULL or 0 values are skipped.

Returns

The maximum number of intersected intervals as a UInt64.

Example

WITH intervals AS (
  SELECT arrayJoin([(1, 5), (3, 7), (6, 10), (2, 4)]) AS interval_pair
)
SELECT maxIntersections(interval_pair.1, interval_pair.2) FROM intervals

Result:

┌─maxIntersections(interval_pair.1, interval_pair.2)─┐
│                                                  3 │
└────────────────────────────────────────────────────┘

maxIntersectionsPosition

Calculates the starting positions where the maximum number of interval overlaps occur.

Syntax

maxIntersectionsPosition(start_column, end_column)

Arguments

  • start_column: A numeric column representing the start of each interval. NULL or 0 values are skipped.
  • end_column: A numeric column representing the end of each interval. NULL or 0 values are skipped.

Returns

An array of starting positions where the maximum number of intersections occur. Array(T).

Example

WITH intervals AS (
  SELECT arrayJoin([(1, 5), (3, 7), (6, 10), (2, 4)]) AS interval_pair
)
SELECT maxIntersectionsPosition(interval_pair.1, interval_pair.2) FROM intervals

Result:

┌─maxIntersectionsPosition(interval_pair.1, interval_pair.2)─┐
│ [3]                                                        │
└────────────────────────────────────────────────────────────┘

maxMap

Calculates the maximum value for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding maximum values.

Syntax

maxMap(key, value)
maxMap(Tuple(key, value))

Alias: maxMappedArrays

Arguments

  • key: An array of keys.
  • value: An array of values. The number of elements in key and value arrays must be the same for each row.

Returns

A tuple containing two arrays: sorted keys and their maximum values. Tuple(Array(T_key), Array(T_value)).

Example

SELECT maxMap(a, b)
FROM values('a Array(String), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))

Result:

┌─maxMap(a, b)───────────┐
│ [['x','y','z'],[2,3,1]]│
└────────────────────────┘

meanZTest

Applies a Z-test for means to compare two populations, assuming normal distribution and known population variances. It determines if the means of the populations are equal.

Syntax

meanZTest(population_variance_x, population_variance_y, confidence_level)(sample_data, sample_index)

Parameters

  • population_variance_x: The known variance of the first population (Float).
  • population_variance_y: The known variance of the second population (Float).
  • confidence_level: The Float confidence level for calculating confidence intervals.

Arguments

  • sample_data: The numeric values from both samples (Integer, Float, or Decimal).
  • sample_index: An Integer (0 or 1) indicating which sample the sample_data belongs to.

Returns

A tuple containing the calculated t-statistic, p-value, and the lower and upper bounds of the confidence interval. Tuple(Float64, Float64, Float64, Float64).

Example

WITH mean_ztest AS (
  SELECT arrayJoin([(20.3,0), (21.9,0), (22.1,0), (18.9,1), (19.0,1), (20.3,1)]) AS data_pair
)
SELECT meanZTest(0.7, 0.45, 0.95)(data_pair.1, data_pair.2) FROM mean_ztest

Result:

┌─meanZTest(0.7, 0.45, 0.95)(data_pair.1, data_pair.2)────────────────────────────┐
│ (3.2841296025548123,0.0010229786769086013,0.8198428246768334,3.2468238419898365) │
└──────────────────────────────────────────────────────────────────────────────────┘

median

Calculates the median of a numeric data sample. This function is an alias for quantile.

Syntax

median(expr)

Aliases: medianDeterministic, medianExact, medianExactWeighted, medianTiming, medianTimingWeighted, medianTDigest, medianTDigestWeighted, medianBFloat16, medianDD. These are aliases for their respective quantile* functions.

Arguments

  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The approximate median of the data. The return type depends on the input expr type (Float64 for numeric, Date for Date, DateTime for DateTime).

Example

SELECT median(number) FROM numbers(5)

Result:

┌─median(number)─┐
│            2.5 │
└────────────────┘

min

Calculates the minimum value within a group.

Syntax

min(expr)

Arguments

  • expr: Any comparable expression.

Returns

The minimum value. The return type matches the expr type.

Example

SELECT min(number) FROM numbers(10)

Result:

┌─min(number)─┐
│           0 │
└─────────────┘

minMap

Calculates the minimum value for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding minimum values.

Syntax

minMap(key, value)
minMap(Tuple(key, value))

Alias: minMappedArrays

Arguments

  • key: An array of keys.
  • value: An array of values. The number of elements in key and value arrays must be the same for each row.

Returns

A tuple containing two arrays: sorted keys and their minimum values. Tuple(Array(T_key), Array(T_value)).

Example

SELECT minMap(a, b)
FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))

Result:

┌─minMap(a, b)──────┐
│ ([1,2,3],[2,1,1]) │
└───────────────────┘

quantile

Computes an approximate quantile of a numeric data sequence using reservoir sampling. The result is non-deterministic.

Syntax

quantile(level)(expr)

Alias: median

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5 (median).
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime. Returns NaN for an empty numeric sequence.

Example

SELECT quantile(number) FROM numbers(5)

Result:

┌─quantile(number)─┐
│            2.5 │
└────────────────┘

quantileBFloat16

Computes an approximate quantile of a numeric data sequence, optimized for bfloat16 numbers. It offers fast estimation with a relative error of no more than 0.390625%.

Syntax

quantileBFloat16[(level)](expr)

Alias: medianBFloat16

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer or Float).

Returns

The approximate quantile of the specified level as a Float64.

Example

SELECT quantileBFloat16(0.75)(number) FROM numbers(5)

Result:

┌─quantileBFloat16(0.75)(number)─┐
│                              3 │
└────────────────────────────────┘

quantileBFloat16Weighted

Computes an approximate quantile of a numeric data sequence, considering the weight of each element, optimized for bfloat16 numbers.

Syntax

quantileBFloat16Weighted[(level)](expr, weight)

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer or Float).
  • weight: A numeric expression representing the weight of each expr value.

Returns

The approximate quantile of the specified level as a Float64.

Example

SELECT quantileBFloat16Weighted(0.75)(number, 1) FROM numbers(5)

Result:

┌─quantileBFloat16Weighted(0.75)(number, 1)─┐
│                                         3 │
└───────────────────────────────────────────┘

quantileDD

Computes an approximate quantile of a numeric data sequence with relative-error guarantees. The accuracy can be controlled by a parameter.

Syntax

quantileDD(relative_accuracy, [level])(expr)

Alias: medianDD

Parameters

  • relative_accuracy: A Float64 (0 to 1) specifying the desired relative accuracy. Recommended: 0.001 or higher.
  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.

Arguments

  • expr: A numeric expression (Integer or Float).

Returns

The approximate quantile of the specified level as a Float64.

Example

SELECT quantileDD(0.01, 0.75)(number) FROM numbers(5)

Result:

┌─quantileDD(0.01, 0.75)(number)─┐
│                            3.0 │
└────────────────────────────────┘

quantileDeterministic

Computes an approximate quantile of a numeric data sequence using a deterministic reservoir sampling algorithm. The result is consistent across executions.

Syntax

quantileDeterministic(level)(expr, determinator)

Alias: medianDeterministic

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
  • determinator: A positive numeric value whose hash is used for deterministic sampling.

Returns

The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileDeterministic(number, 1) FROM numbers(5)

Result:

┌─quantileDeterministic(number, 1)─┐
│                              2.5 │
└──────────────────────────────────┘

quantileExact

Exactly computes the quantile of a numeric data sequence by collecting all values into an array and partially sorting it. This function consumes O(n) memory.

Syntax

quantileExact(level)(expr)

Alias: medianExact

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileExact(number) FROM numbers(10)

Result:

┌─quantileExact(number)─┐
│                     5 │
└───────────────────────┘

quantileExactLow

Exactly computes the quantile of a numeric data sequence. For a level of 0.5, it returns the lower median value for an even number of elements and the middle median for an odd number.

Syntax

quantileExactLow(level)(expr)

Alias: medianExactLow

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileExactLow(number) FROM numbers(10)

Result:

┌─quantileExactLow(number)─┐
│                        4 │
└──────────────────────────┘

quantileExactHigh

Exactly computes the quantile of a numeric data sequence. For a level of 0.5, it returns the higher median value for an even number of elements and the middle median for an odd number.

Syntax

quantileExactHigh(level)(expr)

Alias: medianExactHigh

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileExactHigh(number) FROM numbers(10)

Result:

┌─quantileExactHigh(number)─┐
│                         5 │
└───────────────────────────┘

quantileExactExclusive

Exactly computes the quantile of a numeric data sequence, equivalent to the PERCENTILE.EXC Excel function (type R6). The level parameter excludes 0 and 1.

Syntax

quantileExactExclusive(level)(expr)

Arguments

  • level: An optional Float64 parameter (0 to 1, exclusive) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileExactExclusive(0.5)(number) FROM numbers(10)

Result:

┌─quantileExactExclusive(0.5)(number)─┐
│                                 4.5 │
└─────────────────────────────────────┘

quantileExactInclusive

Exactly computes the quantile of a numeric data sequence, equivalent to the PERCENTILE.INC Excel function (type R7). The level parameter includes 0 and 1.

Syntax

quantileExactInclusive(level)(expr)

Arguments

  • level: An optional Float64 parameter (0 to 1, inclusive) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileExactInclusive(0.5)(number) FROM numbers(10)

Result:

┌─quantileExactInclusive(0.5)(number)─┐
│                                   5 │
└─────────────────────────────────────┘

quantileExactWeighted

Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element. It uses a hash table for efficiency, especially with repeated values.

Syntax

quantileExactWeighted(level)(expr, weight)

Alias: medianExactWeighted

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
  • weight: A UInt64 column representing the weight (number of occurrences) of each expr value.

Returns

The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair)
SELECT quantileExactWeighted(pair.1, pair.2) FROM data

Result:

┌─quantileExactWeighted(pair.1, pair.2)─┐
│                                     1 │
└───────────────────────────────────────┘

quantileExactWeightedInterpolated

Computes the quantile of a numeric data sequence using linear interpolation and considering the weight of each element. It builds a cumulative distribution based on weights.

Syntax

quantileExactWeightedInterpolated(level)(expr, weight)

Alias: medianExactWeightedInterpolated

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
  • weight: A UInt64 column representing the weight (number of occurrences) of each expr value.

Returns

The interpolated quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair)
SELECT quantileExactWeightedInterpolated(pair.1, pair.2) FROM data

Result:

┌─quantileExactWeightedInterpolated(pair.1, pair.2)─┐
│                                               1.5 │
└───────────────────────────────────────────────────┘

quantileGK

Computes an approximate quantile of a numeric data sequence using the Greenwald-Khanna algorithm, which is efficient for data streams and allows control over accuracy.

Syntax

quantileGK(accuracy, level)(expr)

Alias: medianGK

Arguments

  • accuracy: A positive UInt64 integer controlling the accuracy. Higher values mean less error but more resources.
  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The approximate quantile of the specified level and accuracy. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileGK(100, 0.25)(number + 1) FROM numbers(1000)

Result:

┌─quantileGK(100, 0.25)(plus(number, 1))─┐
│                                    251 │
└────────────────────────────────────────┘

quantileInterpolatedWeighted

Computes the quantile of a numeric data sequence using linear interpolation and considering the weight of each element. It sorts values by weight and performs interpolation.

Syntax

quantileInterpolatedWeighted(level)(expr, weight)

Alias: medianInterpolatedWeighted

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
  • weight: A numeric column representing the weight of each expr value.

Returns

The interpolated quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair)
SELECT quantileInterpolatedWeighted(pair.1, pair.2) FROM data

Result:

┌─quantileInterpolatedWeighted(pair.1, pair.2)─┐
│                                            1 │
└──────────────────────────────────────────────┘

quantiles

This is a family of functions that calculate multiple quantiles simultaneously for a given data sequence. They return an array of results for the specified levels.

Syntax

quantiles(level1, level2, ...)(x)

This general syntax applies to specific implementations like quantilesDeterministic, quantilesTiming, quantilesTimingWeighted, quantilesExact, quantilesExactWeighted, quantileExactWeightedInterpolated, quantileInterpolatedWeighted, quantilesTDigest, quantilesBFloat16, quantilesDD.

Arguments

  • level1, level2, ...: One or more Float64 parameters (0 to 1) specifying the quantile levels.
  • x: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

An array of quantiles for the specified levels. Array(T).

Example

SELECT quantiles(0.25, 0.5, 0.75)(number) FROM numbers(10)

Result:

┌─quantiles(0.25, 0.5, 0.75)(number)─┐
│ [2.5,5,7.5]                        │
└────────────────────────────────────┘

quantilesExactExclusive

Exactly computes multiple quantiles of a numeric data sequence, equivalent to the PERCENTILE.EXC Excel function (type R6). It is more efficient for sets of levels than calling quantileExactExclusive multiple times.

Syntax

quantilesExactExclusive(level1, level2, ...)(expr)

Arguments

  • level1, level2, ...: One or more Float64 parameters (0 to 1, exclusive) specifying the quantile levels.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

An array of exact quantiles for the specified levels. Array(T).

Example

SELECT quantilesExactExclusive(0.25, 0.5, 0.75)(number) FROM numbers(10)

Result:

┌─quantilesExactExclusive(0.25, 0.5, 0.75)(number)─┐
│ [2.25,4.5,6.75]                                  │
└──────────────────────────────────────────────────┘

quantilesExactInclusive

Exactly computes multiple quantiles of a numeric data sequence, equivalent to the PERCENTILE.INC Excel function (type R7). It is more efficient for sets of levels than calling quantileExactInclusive multiple times.

Syntax

quantilesExactInclusive(level1, level2, ...)(expr)

Arguments

  • level1, level2, ...: One or more Float64 parameters (0 to 1, inclusive) specifying the quantile levels.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

An array of exact quantiles for the specified levels. Array(T).

Example

SELECT quantilesExactInclusive(0.25, 0.5, 0.75)(number) FROM numbers(10)

Result:

┌─quantilesExactInclusive(0.25, 0.5, 0.75)(number)─┐
│ [2.25,4.5,6.75]                                  │
└──────────────────────────────────────────────────┘

quantilesGK

Computes multiple approximate quantiles of a numeric data sequence using the Greenwald-Khanna algorithm, allowing control over accuracy and returning an array of results.

Syntax

quantilesGK(accuracy, level1, level2, ...)(expr)

Arguments

  • accuracy: A positive UInt64 integer controlling the accuracy.
  • level1, level2, ...: One or more Float64 parameters (0 to 1) specifying the quantile levels.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

An array of approximate quantiles for the specified levels. Array(T).

Example

SELECT quantilesGK(100, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000)

Result:

┌─quantilesGK(100, 0.25, 0.5, 0.75)(plus(number, 1))─┐
│ [251,498,741]                                      │
└────────────────────────────────────────────────────┘

quantileTDigest

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. This function is memory-efficient (log(n)) but non-deterministic.

Syntax

quantileTDigest(level)(expr)

Alias: medianTDigest

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).

Returns

The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileTDigest(number) FROM numbers(10)

Result:

┌─quantileTDigest(number)─┐
│                     4.5 │
└─────────────────────────┘

quantileTDigestWeighted

Computes an approximate quantile of a numeric data sequence using the t-digest algorithm, considering the weight of each element. It is memory-efficient (log(n)) but non-deterministic.

Syntax

quantileTDigestWeighted(level)(expr, weight)

Alias: medianTDigestWeighted

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
  • weight: A numeric column representing the weight of each expr value.

Returns

The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.

Example

SELECT quantileTDigestWeighted(number, 1) FROM numbers(10)

Result:

┌─quantileTDigestWeighted(number, 1)─┐
│                                4.5 │
└────────────────────────────────────┘

quantileTiming

Computes the quantile of a numeric data sequence with determined precision, optimized for time-related distributions (e.g., page loading times). The result is deterministic.

Syntax

quantileTiming(level)(expr)

Alias: medianTiming

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A Float* expression representing time values. Negative values lead to undefined behavior. Values greater than 30,000 are treated as 30,000.

Returns

The quantile of the specified level as a Float32. Returns NaN if no values are passed.

Example

WITH data AS (SELECT arrayJoin([72, 112, 126, 145, 104, 242, 313, 168, 108]) AS response_time)
SELECT quantileTiming(response_time) FROM data

Result:

┌─quantileTiming(response_time)─┐
│                           126 │
└───────────────────────────────┘

quantileTimingWeighted

Computes the quantile of a numeric data sequence with determined precision, considering the weight of each element, optimized for time-related distributions. The result is deterministic.

Syntax

quantileTimingWeighted(level)(expr, weight)

Alias: medianTimingWeighted

Arguments

  • level: An optional Float64 parameter (0 to 1) specifying the quantile level. Default: 0.5.
  • expr: A Float* expression representing time values. Negative values lead to undefined behavior. Values greater than 30,000 are treated as 30,000.
  • weight: A numeric column representing the weight of each expr value.

Returns

The quantile of the specified level as a Float32. Returns NaN if no values are passed.

Example

WITH data AS (SELECT arrayJoin([(68,1), (104,2), (112,3), (126,2), (138,1), (162,1)]) AS pair)
SELECT quantileTimingWeighted(pair.1, pair.2) FROM data

Result:

┌─quantileTimingWeighted(pair.1, pair.2)─┐
│                                    112 │
└────────────────────────────────────────┘

quantilesTimingWeighted

Computes multiple quantiles of a numeric data sequence with determined precision, considering the weight of each element, optimized for time-related distributions. It returns an array of results for the specified levels.

Syntax

quantilesTimingWeighted(level1, level2, ...)(expr, weight)

Arguments

  • level1, level2, ...: One or more Float64 parameters (0 to 1) specifying the quantile levels.
  • expr: A Float* expression representing time values.
  • weight: A numeric column representing the weight of each expr value.

Returns

An array of quantiles for the specified levels. Array(Float32).

Example

WITH data AS (SELECT arrayJoin([(68,1), (104,2), (112,3), (126,2), (138,1), (162,1)]) AS pair)
SELECT quantilesTimingWeighted(0.5, 0.99)(pair.1, pair.2) FROM data

Result:

┌─quantilesTimingWeighted(0.5, 0.99)(pair.1, pair.2)─┐
│ [112,162]                                          │
└────────────────────────────────────────────────────┘

rankCorr

Computes a rank correlation coefficient between two numeric variables, measuring the monotonic relationship between them.

Syntax

rankCorr(x, y)

Arguments

  • x: The first numeric variable (Float32 or Float64).
  • y: The second numeric variable (Float32 or Float64).

Returns

A Float64 rank correlation coefficient between -1 and +1.

Example

SELECT rankCorr(number, number) FROM numbers(10)

Result:

┌─rankCorr(number, number)─┐
│                        1 │
└──────────────────────────┘

simpleLinearRegression

Performs a simple linear regression to find the best-fit line y = k*x + b for a set of (x, y) data points.

Syntax

simpleLinearRegression(x, y)

Arguments

  • x: The column of independent variable values.
  • y: The column of dependent variable values.

Returns

A tuple containing the slope k and the y-intercept b of the regression line. Tuple(Float64, Float64).

Example

SELECT simpleLinearRegression(x, y)
FROM values('x Int64, y Int64', (0, 0), (1, 1), (2, 2), (3, 3))

Result:

┌─simpleLinearRegression(x, y)─┐
│ (1,0)                        │
└──────────────────────────────┘

singleValueOrNull

Checks if there is exactly one unique non-NULL value in a group. If so, it returns that value; otherwise, it returns NULL.

Syntax

singleValueOrNull(x)

Arguments

  • x: A column of any data type (excluding Map, Array, or Tuple).

Returns

The unique non-NULL value if exactly one exists, otherwise NULL. The return type matches the x type, but is Nullable.

Example

SELECT singleValueOrNull(col)
FROM values('col String', 'A', 'A', 'A')

Result:

┌─singleValueOrNull(col)─┐
│ A                      │
└────────────────────────┘

skewPop

Computes the population skewness of a sequence of numbers, which measures the asymmetry of the probability distribution.

Syntax

skewPop(expr)

Arguments

  • expr: A numeric expression.

Returns

The population skewness as a Float64.

Example

SELECT skewPop(number) FROM numbers(10)

Result:

┌─skewPop(number)─┐
│               0 │
└─────────────────┘

skewSamp

Computes the sample skewness of a sequence of numbers, providing an unbiased estimate of the skewness of a random variable from a sample.

Syntax

skewSamp(expr)

Arguments

  • expr: A numeric expression.

Returns

The sample skewness as a Float64. Returns NaN if the sample size is 1 or less.

Example

SELECT skewSamp(number) FROM numbers(10)

Result:

┌─skewSamp(number)─┐
│                0 │
└──────────────────┘

sparkbar

Generates a frequency histogram as a string visualization for values x and their frequencies y over a specified interval.

Syntax

sparkbar(buckets[, min_x, max_x])(x, y)

Parameters

  • buckets: The UInt64 number of segments (bars) in the histogram.
  • min_x: Optional. The minimum value for the x-axis interval.
  • max_x: Optional. The maximum value for the x-axis interval.

Arguments

  • x: The values to categorize.
  • y: The frequency or count for each x value.

Returns

A String representing the sparkbar histogram.

Example

SELECT sparkbar(5)(number, 1) FROM numbers(10)

Result:

┌─sparkbar(5)(number, 1)─┐
│ █▊▊▊▊                  │
└────────────────────────┘

stddevPop

Calculates the population standard deviation, which is the square root of the population variance. This function uses a numerically unstable algorithm.

Syntax

stddevPop(x)

Aliases: STD, STDDEV_POP

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The population standard deviation as a Float64.

Example

SELECT stddevPop(number) FROM numbers(10)

Result:

┌─stddevPop(number)─┐
│ 2.8722813232690143 │
└───────────────────┘

stddevPopStable

Calculates the population standard deviation using a numerically stable algorithm. This function is generally slower but provides more accurate results than stddevPop.

Syntax

stddevPopStable(x)

Alias: VAR_POP_STABLE

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The population standard deviation as a Float64.

Example

SELECT stddevPopStable(number) FROM numbers(10)

Result:

┌─stddevPopStable(number)─┐
│      2.8722813232690143 │
└─────────────────────────┘

stddevSamp

Calculates the sample standard deviation, which is the square root of the sample variance. This function uses a numerically unstable algorithm.

Syntax

stddevSamp(x)

Alias: STDDEV_SAMP

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The sample standard deviation as a Float64.

Example

SELECT stddevSamp(number) FROM numbers(10)

Result:

┌─stddevSamp(number)─┐
│ 3.0276503540974917 │
└────────────────────┘

stddevSampStable

Calculates the sample standard deviation using a numerically stable algorithm. This function is generally slower but provides more accurate results than stddevSamp.

Syntax

stddevSampStable(x)

Alias: VAR_SAMP_STABLE

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The sample standard deviation as a Float64.

Example

SELECT stddevSampStable(number) FROM numbers(10)

Result:

┌─stddevSampStable(number)─┐
│       3.0276503540974917 │
└──────────────────────────┘

studentTTest

Applies Student's t-test to compare two samples and determine if their population means are equal, assuming normal distribution and equal variances.

Syntax

studentTTest([confidence_level])(sample_data, sample_index)

Parameters

  • confidence_level: An optional Float64 confidence level for calculating confidence intervals.

Arguments

  • sample_data: The numeric values from both samples (Integer, Float, or Decimal).
  • sample_index: An Integer (0 or 1) indicating which sample the sample_data belongs to.

Returns

A tuple containing the calculated t-statistic and p-value. If confidence_level is specified, it also includes the lower and upper bounds of the confidence interval. Tuple(Float64, Float64) or Tuple(Float64, Float64, Float64, Float64).

Example

WITH student_ttest AS (
  SELECT arrayJoin([(20.3,0), (21.1,0), (21.9,1), (21.7,0), (19.9,1), (21.8,1)]) AS data_pair
)
SELECT studentTTest(data_pair.1, data_pair.2) FROM student_ttest

Result:

┌─studentTTest(data_pair.1, data_pair.2)────┐
│ (-0.21739130434783777,0.8385421208415731) │
└───────────────────────────────────────────┘

sum

Calculates the sum of numeric values within a group.

Syntax

sum(num)

Arguments

  • num: A numeric expression (Integer, Float, or Decimal).

Returns

The sum of the values. The return type matches the num type.

Example

SELECT sum(number) FROM numbers(10)

Result:

┌─sum(number)─┐
│          45 │
└─────────────┘

sumCount

Calculates both the sum of numeric values and the count of non-NULL rows within a group.

Syntax

sumCount(x)

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

A tuple containing the sum of numbers and the count of non-NULL values. Tuple(T_sum, UInt64).

Example

SELECT sumCount(number) FROM numbers(10)

Result:

┌─sumCount(number)─┐
│ (45,10)          │
└──────────────────┘

sumKahan

Calculates the sum of numeric values using the Kahan compensated summation algorithm, which provides higher precision for floating-point numbers than a simple sum.

Syntax

sumKahan(x)

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The sum of the values, with a type matching the input x.

Example

SELECT sum(0.1), sumKahan(0.1) FROM numbers(10)

Result:

┌───────────sum(0.1)─┬─sumKahan(0.1)─┐
│ 0.9999999999999999 │             1 │
└────────────────────┴───────────────┘

sumMap

Totals values for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding summed values, without overflow.

Syntax

sumMap(key, value)
sumMap(Tuple(key, value))

Alias: sumMappedArrays

Arguments

  • key: An array of keys.
  • value: An array of values. The number of elements in key and value arrays must be the same for each row.

Returns

A tuple containing two arrays: sorted keys and their summed values. Tuple(Array(T_key), Array(T_value)).

Example

SELECT sumMap(a, b)
FROM values('a Array(String), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))

Result:

┌─sumMap(a, b)───────────┐
│ [['x','y','z'],[2,5,1]]│
└────────────────────────┘

sumMapWithOverflow

Totals values for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding summed values, allowing for overflow if the sum exceeds the data type's maximum.

Syntax

sumMapWithOverflow(key, value)
sumMapWithOverflow(Tuple(key, value))

Arguments

  • key: An array of keys.
  • value: An array of values. The number of elements in key and value arrays must be the same for each row.

Returns

A tuple containing two arrays: sorted keys and their summed values. Tuple(Array(T_key), Array(T_value)).

Example

SELECT sumMapWithOverflow(a, b)
FROM values('a Array(String), b Array(Int8)', (['x', 'y'], [100, 50]), (['y', 'z'], [80, 20]))

Result:

┌─sumMapWithOverflow(a, b)───────────┐
│ [['x','y','z'],[100,-126,20]]      │
└────────────────────────────────────┘

sumWithOverflow

Calculates the sum of numeric values within a group, using the same data type for the result as the input. If the sum exceeds the maximum value for that data type, it will overflow.

Syntax

sumWithOverflow(num)

Arguments

  • num: A numeric expression (Integer, Float, or Decimal).

Returns

The sum of the values, with a type matching the input num.

Example

SELECT sumWithOverflow(toUInt8(number)) FROM numbers(200)

Result:

┌─sumWithOverflow(toUInt8(number))─┐
│                              104 │
└──────────────────────────────────┘

theilsU

Calculates Theil's U uncertainty coefficient, a measure of association between two categorical variables. The value ranges from -1.0 (perfect negative association) to +1.0 (perfect positive association), with 0.0 indicating no association.

Syntax

theilsU(column1, column2)

Arguments

  • column1: The first categorical column.
  • column2: The second categorical column.

Returns

A Float64 value between -1 and 1.

Example

SELECT
    theilsU(a ,b)
FROM
    (
        SELECT
            number % 10 AS a,
            number % 4 AS b
        FROM
            numbers(150)
    )

Result:

┌────────theilsU(a, b)─┐
│ -0.30195720557678846 │
└──────────────────────┘

topK

Returns an array of the approximately most frequent values in a specified column, sorted by approximate frequency in descending order. It uses the Filtered Space-Saving algorithm.

Syntax

topK(N)(column)
topK(N, load_factor)(column)
topK(N, load_factor, 'counts')(column)

Parameters

  • N: The maximum UInt64 number of elements to return. Optional. Default: 10.
  • load_factor: A UInt64 factor determining the number of cells reserved for values. Optional. Default: 3.
  • 'counts': A String literal to include approximate counts and error values in the result.

Arguments

  • column: The column whose value frequency is to be calculated.

Returns

An array of the most frequent values. If 'counts' is specified, it returns an array of tuples (value, approximate_count, error). Array(T) or Array(Tuple(T, UInt64, UInt64)).

Example

SELECT topK(3)(number % 5) FROM numbers(100)

Result:

┌─topK(3)(modulo(number, 5))─┐
│ [0,1,2]                    │
└────────────────────────────┘

topKWeighted

Returns an array of the approximately most frequent values in a specified column, considering a weight for each value. The result is sorted by approximate weighted frequency in descending order.

Syntax

topKWeighted(N)(column, weight)
topKWeighted(N, load_factor)(column, weight)
topKWeighted(N, load_factor, 'counts')(column, weight)

Parameters

  • N: The maximum UInt64 number of elements to return. Optional. Default: 10.
  • load_factor: A UInt64 factor determining the number of cells reserved for values. Optional. Default: 3.
  • 'counts': A String literal to include approximate counts and error values in the result.

Arguments

  • column: The column whose value frequency is to be calculated.
  • weight: A UInt64 value representing the weight for each column entry.

Returns

An array of the values with the maximum approximate sum of weights. If 'counts' is specified, it returns an array of tuples (value, approximate_weighted_count, error). Array(T) or Array(Tuple(T, UInt64, UInt64)).

Example

SELECT topKWeighted(2)(k, w) FROM
VALUES('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))

Result:

┌─topKWeighted(2)(k, w)──┐
│ ['z','x']              │
└────────────────────────┘

uniq

Calculates the approximate number of distinct values for an argument using an adaptive sampling algorithm. This function is generally recommended for most scenarios due to its balance of accuracy and performance.

Syntax

uniq(x[, ...])

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The approximate number of distinct values as a UInt64.

Example

SELECT uniq(number % 10) FROM numbers(100)

Result:

┌─uniq(modulo(number, 10))─┐
│                       10 │
└──────────────────────────┘

uniqCombined

Calculates the approximate number of distinct argument values using a combination of array, hash table, and HyperLogLog algorithms. It offers better memory efficiency and accuracy than uniq but can be slightly slower.

Syntax

uniqCombined(HLL_precision)(x[, ...])

Parameters

  • HLL_precision: An optional UInt64 (base-2 logarithm of HyperLogLog cells). Default: 17.

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The approximate number of distinct values as a UInt64.

Example

SELECT uniqCombined(number) FROM numbers(1e6)

Result:

┌─uniqCombined(number)─┐
│              1001148 │
└──────────────────────┘

uniqCombined64

Calculates the approximate number of distinct argument values, similar to uniqCombined, but uses a 64-bit hash for all data types. This provides higher accuracy for very large cardinalities, especially for non-string types.

Syntax

uniqCombined64(HLL_precision)(x[, ...])

Parameters

  • HLL_precision: An optional UInt64 (base-2 logarithm of HyperLogLog cells). Default: 17.

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The approximate number of distinct values as a UInt64.

Example

SELECT uniqCombined64(number) FROM numbers(1e6)

Result:

┌─uniqCombined64(number)─┐
│                1000000 │
└────────────────────────┘

uniqExact

Calculates the exact number of distinct argument values. This function consumes O(n) memory, where n is the number of distinct values, and should be used when absolute precision is required.

Syntax

uniqExact(x[, ...])

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The exact number of distinct values as a UInt64.

Example

SELECT uniqExact(number % 10) FROM numbers(100)

Result:

┌─uniqExact(modulo(number, 10))─┐
│                            10 │
└───────────────────────────────┘

uniqHLL12

Calculates the approximate number of distinct argument values using the HyperLogLog algorithm with 2^12 5-bit cells. This function is generally not recommended for new development due to its lower accuracy compared to uniq or uniqCombined.

Syntax

uniqHLL12(x[, ...])

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The approximate number of distinct values as a UInt64.

Example

SELECT uniqHLL12(number) FROM numbers(1000)

Result:

┌─uniqHLL12(number)─┐
│               999 │
└───────────────────┘

uniqTheta

Calculates the approximate number of distinct argument values using the Theta Sketch Framework (KMV algorithm). It provides a relative error of 3.125% (95% confidence).

Syntax

uniqTheta(x[, ...])

Arguments

  • x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).

Returns

The approximate number of distinct values as a UInt64.

Example

SELECT uniqTheta(number) FROM numbers(1000)

Result:

┌─uniqTheta(number)─┐
│              1000 │
└───────────────────┘

varPop

Calculates the population variance of a set of numeric values. This function uses a numerically unstable algorithm.

Syntax

varPop(x)

Alias: VAR_POP

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The population variance as a Float64.

Example

SELECT varPop(number) FROM numbers(10)

Result:

┌─varPop(number)─┐
│           8.25 │
└────────────────┘

varPopStable

Calculates the population variance of a set of numeric values using a numerically stable algorithm. This function is generally slower but provides more accurate results than varPop.

Syntax

varPopStable(x)

Alias: VAR_POP_STABLE

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The population variance as a Float64.

Example

SELECT varPopStable(number) FROM numbers(10)

Result:

┌─varPopStable(number)─┐
│                 8.25 │
└──────────────────────┘

varSamp

Calculates the sample variance of a set of numeric values. This function uses a numerically unstable algorithm.

Syntax

varSamp(x)

Alias: VAR_SAMP

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The sample variance as a Float64.

Example

SELECT varSamp(number) FROM numbers(10)

Result:

┌─varSamp(number)─┐
│            9.166666666666666 │
└─────────────────┘

varSampStable

Calculates the sample variance of a set of numeric values using a numerically stable algorithm. This function is generally slower but provides more accurate results than varSamp.

Syntax

varSampStable(x)

Alias: VAR_SAMP_STABLE

Arguments

  • x: A numeric expression (Integer, Float, or Decimal).

Returns

The sample variance as a Float64.

Example

SELECT varSampStable(number) FROM numbers(10)

Result:

┌─varSampStable(number)─┐
│    9.166666666666666 │
└───────────────────────┘

welchTTest

Applies Welch's t-test to compare two samples and determine if their population means are equal, without assuming equal variances.

Syntax

welchTTest([confidence_level])(sample_data, sample_index)

Parameters

  • confidence_level: An optional Float64 confidence level for calculating confidence intervals.

Arguments

  • sample_data: The numeric values from both samples (Integer, Float, or Decimal).
  • sample_index: An Integer (0 or 1) indicating which sample the sample_data belongs to.

Returns

A tuple containing the calculated t-statistic and p-value. If confidence_level is specified, it also includes the lower and upper bounds of the confidence interval. Tuple(Float64, Float64) or Tuple(Float64, Float64, Float64, Float64).

Example

WITH welch_ttest AS (
  SELECT arrayJoin([(20.3,0), (22.1,0), (21.9,0), (18.9,1), (20.3,1), (19.0,1)]) AS data_pair
)
SELECT welchTTest(data_pair.1, data_pair.2) FROM welch_ttest

Result:

┌─welchTTest(data_pair.1, data_pair.2)──────┐
│ (2.7988719532211235,0.051807360348581945) │
└───────────────────────────────────────────┘
Updated