Aggregate functions¶
Aggregate functions perform a calculation on a set of values and return a single value. They are commonly used in data analysis and reporting to summarize large datasets. These functions can be used to compute sums, averages, counts, and other statistical measures.
During aggregation, NULL arguments are generally skipped. If an aggregate function has multiple arguments, any row where one or more of these arguments are NULL will be ignored. Some functions, like first_value or last_value with the RESPECT NULLS modifier, are exceptions to this rule.
Aggregate function combinators¶
The name of an aggregate function can have a suffix appended to it. This changes the way the aggregate function works.
-If¶
The suffix -If can be appended to the name of any aggregate function. In this case, the aggregate function accepts an extra argument – a condition (Uint8 type). The aggregate function processes only the rows that trigger the condition. If the condition was not triggered even once, it returns a default value (usually zeros or empty strings).
Examples: sumIf(column, cond), countIf(cond), avgIf(x, cond), quantilesTimingIf(level1, level2)(x, cond), argMinIf(arg, val, cond) and so on.
With conditional aggregate functions, you can calculate aggregates for several conditions at once, without using subqueries and JOINs. For example, conditional aggregate functions can be used to implement the segment comparison functionality.
-Array¶
The -Array suffix can be appended to any aggregate function. In this case, the aggregate function takes arguments of the 'Array(T)' type (arrays) instead of 'T' type arguments. If the aggregate function accepts multiple arguments, this must be arrays of equal lengths. When processing arrays, the aggregate function works like the original aggregate function across all array elements.
Example 1: sumArray(arr) - Totals all the elements of all 'arr' arrays. In this example, it could have been written more simply: sum(arraySum(arr)).
Example 2: uniqArray(arr) – Counts the number of unique elements in all 'arr' arrays. This could be done an easier way: uniq(arrayJoin(arr)), but it's not always possible to add 'arrayJoin' to a query.
-If and -Array can be combined. However, 'Array' must come first, then 'If'. Examples: uniqArrayIf(arr, cond), quantilesTimingArrayIf(level1, level2)(arr, cond). Due to this order, the 'cond' argument won't be an array.
-Map¶
The -Map suffix can be appended to any aggregate function. This will create an aggregate function which gets Map type as an argument, and aggregates values of each key of the map separately using the specified aggregate function. The result is also of a Map type.
Example¶
WITH map_map AS (
SELECT c1::Date AS date, c2::DateTime AS timeslot, c3::Map(String, UInt16) AS status FROM values (
('2000-01-01', '2000-01-01 00:00:00', (['a', 'b', 'c'], [10, 10, 10])),
('2000-01-01', '2000-01-01 00:00:00', (['c', 'd', 'e'], [10, 10, 10])),
('2000-01-01', '2000-01-01 00:01:00', (['d', 'e', 'f'], [10, 10, 10])),
('2000-01-01', '2000-01-01 00:01:00', (['f', 'g', 'g'], [10, 10, 10]))
))
SELECT
timeslot,
sumMap(status),
avgMap(status),
minMap(status)
FROM map_map
GROUP BY timeslot
Result:
┌────────────timeslot─┬─sumMap(status)───────────────────────┬─avgMap(status)───────────────────────┬─minMap(status)───────────────────────┐
│ 2000-01-01 00:00:00 │ {'a':10,'b':10,'c':20,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │ {'a':10,'b':10,'c':10,'d':10,'e':10} │
│ 2000-01-01 00:01:00 │ {'d':10,'e':10,'f':20,'g':20} │ {'d':10,'e':10,'f':10,'g':10} │ {'d':10,'e':10,'f':10,'g':10} │
└─────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┴──────────────────────────────────────┘
-SimpleState¶
If you apply this combinator, the aggregate function returns the same value but with a different type. This is a SimpleAggregateFunction(...) that can be stored in a table to work with AggregatingMergeTree tables.
Syntax¶
<aggFunction>SimpleState(x)
Arguments¶
x: Aggregate function parameters.
Returned values¶
The value of an aggregate function with the SimpleAggregateFunction(...) type.
Example¶
Query:
WITH anySimpleState(number) AS c SELECT toTypeName(c), c FROM numbers(1)
Result:
┌─toTypeName(c)────────────────────────┬─c─┐ │ SimpleAggregateFunction(any, UInt64) │ 0 │ └──────────────────────────────────────┴───┘
-State¶
If you apply this combinator, the aggregate function doesn't return the resulting value (such as the number of unique values for the uniq function), but an intermediate state of the aggregation (for uniq, this is the hash table for calculating the number of unique values). This is an AggregateFunction(...) that can be used for further processing or stored in a table to finish aggregating later.
-MapState isn't an invariant for the same data because the order of data in intermediate state can change, though it doesn't impact ingestion of this data.
To work with these states, use:
- AggregatingMergeTree table engine.
- finalizeAggregation function.
- runningAccumulate function.
- -Merge combinator.
- -MergeState combinator.
-Merge¶
If you apply this combinator, the aggregate function takes the intermediate aggregation state as an argument, combines the states to finish aggregation, and returns the resulting value.
-MergeState¶
Merges the intermediate aggregation states in the same way as the -Merge combinator. However, it doesn't return the resulting value, but an intermediate aggregation state, similar to the -State combinator.
-ForEach¶
Converts an aggregate function for tables into an aggregate function for arrays that aggregates the corresponding array items and returns an array of results. For example, sumForEach for the arrays [1, 2], [3, 4, 5]and[6, 7]returns the result [10, 13, 5] after adding together the corresponding array items.
-Distinct¶
Every unique combination of arguments will be aggregated only once. Repeating values are ignored. Examples: sum(DISTINCT x) (or sumDistinct(x)), groupArray(DISTINCT x) (or groupArrayDistinct(x)), corrStable(DISTINCT x, y) (or corrStableDistinct(x, y)) and so on.
-OrDefault¶
Changes behavior of an aggregate function.
If an aggregate function doesn't have input values, with this combinator it returns the default value for its return data type. Applies to the aggregate functions that can take empty input data.
-OrDefault can be used with other combinators.
Syntax¶
<aggFunction>OrDefault(x)
Arguments¶
x: Aggregate function parameters.
Returned values¶
Returns the default value of an aggregate function's return type if there is nothing to aggregate.
Type depends on the aggregate function used.
Example¶
Query:
SELECT avg(number), avgOrDefault(number) FROM numbers(0)
Result:
┌─avg(number)─-─avgOrDefault(number)─┐ │ nan │ 0 │ └─────────────┴──────────────────────┘
Also -OrDefault can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.
Query:
SELECT avgOrDefaultIf(x, x > 10)
FROM
(
SELECT toDecimal32(1.23, 2) AS x
)
Result:
┌─avgOrDefaultIf(x, greater(x, 10))─┐ │ 0.00 │ └───────────────────────────────────┘
-OrNull¶
Changes behavior of an aggregate function.
This combinator converts a result of an aggregate function to the Nullable data type. If the aggregate function doesn't have values to calculate it returns NULL.
-OrNull can be used with other combinators.
Syntax¶
<aggFunction>OrNull(x)
Arguments¶
x: Aggregate function parameters.
Returned values¶
- The result of the aggregate function, converted to the
Nullabledata type. NULL, if there is nothing to aggregate.
Type: Nullable(aggregate function return type).
Example¶
Add -orNull to the end of aggregate function.
Query:
SELECT sumOrNull(number), toTypeName(sumOrNull(number)) FROM numbers(10) WHERE number > 10
Result:
┌─sumOrNull(number)─┬─toTypeName(sumOrNull(number))─┐ │ ᴺᵁᴸᴸ │ Nullable(UInt64) │ └───────────────────┴───────────────────────────────┘
Also -OrNull can be used with another combinators. It is useful when the aggregate function doesn't accept the empty input.
Query:
SELECT avgOrNullIf(x, x > 10)
FROM
(
SELECT toDecimal32(1.23, 2) AS x
)
Result:
┌─avgOrNullIf(x, greater(x, 10))─┐ │ ᴺᵁᴸᴸ │ └────────────────────────────────┘
-Resample¶
Lets you divide data into groups, and then separately aggregates the data in those groups. Groups are created by splitting the values from one column into intervals.
<aggFunction>Resample(start, end, step)(<aggFunction_params>, resampling_key)
Arguments¶
start: Starting value of the whole required interval forresampling_keyvalues.stop: Ending value of the whole required interval forresampling_keyvalues. The whole interval doesn't include thestopvalue[start, stop).step: Step for separating the whole interval into subintervals. TheaggFunctionis executed over each of those subintervals independently.resampling_key: Column whose values are used for separating data into intervals.aggFunction_params:aggFunctionparameters.
Returned values¶
- Array of
aggFunctionresults for each subinterval.
Example¶
Consider the people table with the following data:
┌─name───┬─age─┬─wage─┐ │ John │ 16 │ 10 │ │ Alice │ 30 │ 15 │ │ Mary │ 35 │ 8 │ │ Evelyn │ 48 │ 11.5 │ │ David │ 62 │ 9.9 │ │ Brian │ 60 │ 16 │ └────────┴─────┴──────┘
Let's get the names of the people whose age lies in the intervals of [30,60) and [60,75). Since we use integer representation for age, we get ages in the [30, 59] and [60,74] intervals.
To aggregate names in an array, we use the groupArray aggregate function. It takes one argument. In our case, it's the name column. The groupArrayResample function should use the age column to aggregate names by age. To define the required intervals, we pass the 30, 75, 30 arguments into the groupArrayResample function.
SELECT groupArrayResample(30, 75, 30)(name, age) FROM people
Result:
┌─groupArrayResample(30, 75, 30)(name, age)─────┐ │ [['Alice','Mary','Evelyn'],['David','Brian']] │ └───────────────────────────────────────────────┘
Consider the results.
John is out of the sample because he's too young. Other people are distributed according to the specified age intervals.
Now let's count the total number of people and their average wage in the specified age intervals.
SELECT
countResample(30, 75, 30)(name, age) AS amount,
avgResample(30, 75, 30)(wage, age) AS avg_wage
FROM people
Result:
┌─amount─┬─avg_wage──────────────────┐ │ [3,2] │ [11.5,12.949999809265137] │ └────────┴───────────────────────────┘
-ArgMin¶
The suffix -ArgMin can be appended to the name of any aggregate function. In this case, the aggregate function accepts an additional argument, which should be any comparable expression. The aggregate function processes only the rows that have the minimum value for the specified extra expression.
Examples: sumArgMin(column, expr), countArgMin(expr), avgArgMin(x, expr) and so on.
-ArgMax¶
Similar to suffix -ArgMin but processes only the rows that have the maximum value for the specified extra expression.
analysisOfVariance¶
Performs a one-way analysis of variance (ANOVA) test to determine if the means of two or more groups are significantly different.
Syntax¶
analysisOfVariance(val, group_no)
Alias: anova
Arguments¶
val: Numeric value for the observation.group_no: Integer representing the group thevalbelongs to. Groups are enumerated starting from 0.
Returns¶
A tuple containing the F-statistic and the p-value of the ANOVA test. Tuple(Float64, Float64).
Example¶
SELECT analysisOfVariance(number, number % 2) FROM numbers(100)
Result:
┌─analysisOfVariance(number, modulo(number, 2))─┐ │ (0,1) │ └───────────────────────────────────────────────┘
any¶
Selects the first encountered non-NULL value from a column within a group. The order of selection is non-deterministic unless the data is explicitly ordered.
Syntax¶
any(column) [RESPECT NULLS]
Aliases: any_value, first_value. For any(column) RESPECT NULLS: any_respect_nulls, first_value_respect_nulls, any_value_respect_nulls.
Arguments¶
column: The column from which to select a value.
Returns¶
The first value encountered. If no non-NULL values are found, it returns the default value for the column's type (e.g., 0 for integers, empty string for strings). If RESPECT NULLS is used, it returns the first value encountered, even if it's NULL. The return type matches the input column's type.
Example¶
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT any(city), any_respect_nulls(city) FROM cte
Result:
┌─any(city)─┬─any_respect_nulls(city)─┐ │ Amsterdam │ ᴺᵁᴸᴸ │ └───────────┴─────────────────────────┘
anyHeavy¶
Identifies a frequently occurring value in a column using the heavy hitters algorithm. This function is non-deterministic and aims to find a value that appears in more than half of the cases within each execution thread.
Syntax¶
anyHeavy(column)
Arguments¶
column: The column from which to find a frequently occurring value.
Returns¶
A frequently occurring value from the column. The return type matches the input column's type.
Example¶
WITH cte AS (SELECT arrayJoin([2,1,1,1,3,1,1,2,2]) as n) SELECT any(n), anyHeavy(n) FROM cte
Result:
┌─any(n)─┬─anyHeavy(n)─┐ │ 2 │ 1 │ └────────┴─────────────┘
anyLast¶
Selects the last encountered non-NULL value from a column within a group. The order of selection is non-deterministic unless the data is explicitly ordered.
Syntax¶
anyLast(column) [RESPECT NULLS]
Alias: last_value. For anyLast(column) RESPECT NULLS: anyLast_respect_nulls, last_value_respect_nulls.
Arguments¶
column: The column from which to select a value.
Returns¶
The last value encountered. If no non-NULL values are found, it returns the default value for the column's type. If RESPECT NULLS is used, it returns the last value encountered, even if it's NULL. The return type matches the input column's type.
Example¶
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT anyLast(city), anyLast_respect_nulls(city) FROM cte
Result:
┌─anyLast(city)─┬─anyLast_respect_nulls(city)─┐ │ Valencia │ ᴺᵁᴸᴸ │ └───────────────┴─────────────────────────────┘
approx_top_k¶
Returns an array of the approximately most frequent values and their counts from a column. The result is sorted by approximate frequency in descending order.
Syntax¶
approx_top_k(N)(column) approx_top_k(N, reserved)(column)
Alias: approx_top_count
Parameters¶
N: The maximum number of elements to return. Optional. Default: 10.reserved: The number of cells reserved for values. If the number of unique values exceedsreserved, the result becomes approximate. Optional. Default:N * 3.
Arguments¶
column: The column whose value frequency is to be calculated.
Returns¶
An array of tuples, where each tuple contains a value, its approximate count, and an error value. Array(Tuple(T, UInt64, UInt64)).
Example¶
SELECT approx_top_k(2)(k)
FROM values('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─approx_top_k(2)(k)────┐
│ [('y',3,0),('x',1,0)] │
└───────────────────────┘
approx_top_sum¶
Returns an array of the approximately most frequent values and their weighted counts from a column. The result is sorted by approximate weighted frequency in descending order.
Syntax¶
approx_top_sum(N)(column, weight) approx_top_sum(N, reserved)(column, weight)
Parameters¶
N: The maximum number of elements to return. Optional. Default: 10.reserved: The number of cells reserved for values. If the number of unique values exceedsreserved, the result becomes approximate. Optional. Default:N * 3.
Arguments¶
column: The column whose value frequency is to be calculated.weight: AUInt64value representing the weight for eachcolumnentry.
Returns¶
An array of tuples, where each tuple contains a value, its approximate weighted count, and an error value. Array(Tuple(T, UInt64, UInt64)).
Example¶
SELECT approx_top_sum(2)(k, w)
FROM values('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─approx_top_sum(2)(k, w)─┐
│ [('z',10,0),('x',5,0)] │
└─────────────────────────┘
argMax¶
Retrieves the arg value associated with the maximum val value within a group. If multiple val values are tied for the maximum, the returned arg is non-deterministic. NULL values in arg or val are skipped.
Syntax¶
argMax(arg, val)
Arguments¶
arg: The argument whose value is to be returned.val: The value used to determine the maximum.
Returns¶
The arg value corresponding to the maximum val. The return type matches the arg type.
Example¶
WITH salary_data AS (
SELECT user, salary FROM values('user String, salary UInt32', ('director', 5000), ('manager', 3000), ('worker', 1000))
)
SELECT argMax(user, salary) FROM salary_data
Result:
┌─argMax(user, salary)─┐ │ director │ └──────────────────────┘
argMin¶
Retrieves the arg value associated with the minimum val value within a group. If multiple val values are tied for the minimum, the returned arg is non-deterministic. NULL values in arg or val are skipped.
Syntax¶
argMin(arg, val)
Arguments¶
arg: The argument whose value is to be returned.val: The value used to determine the minimum.
Returns¶
The arg value corresponding to the minimum val. The return type matches the arg type.
Example¶
WITH salary_data AS (
SELECT user, salary FROM values('user String, salary UInt32', ('director', 5000), ('manager', 3000), ('worker', 1000))
)
SELECT argMin(user, salary) FROM salary_data
Result:
┌─argMin(user, salary)─┐ │ worker │ └──────────────────────┘
avg¶
Calculates the arithmetic mean (average) of a set of numeric values.
Syntax¶
avg(x)
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The arithmetic mean as a Float64. Returns NaN if the input set is empty.
Example¶
SELECT avg(x) FROM values('x Int8', 0, 1, 2, 3, 4, 5)
Result:
┌─avg(x)─┐ │ 2.5 │ └────────┘
avgWeighted¶
Calculates the weighted arithmetic mean of a set of values, where each value contributes according to its specified weight.
Syntax¶
avgWeighted(x, weight)
Arguments¶
x: The numeric values (Integer or Float).weight: The numeric weights corresponding to each value (Integer or Float).
Returns¶
The weighted arithmetic mean as a Float64. Returns NaN if all weights are zero or the input set is empty.
Example¶
SELECT avgWeighted(x, w)
FROM values('x Int8, w Int8', (4, 1), (1, 0), (10, 2))
Result:
┌─avgWeighted(x, weight)─┐ │ 8 │ └────────────────────────┘
boundingRatio¶
Calculates the slope of the line connecting the leftmost and rightmost points in a series of (x, y) coordinates.
Syntax¶
boundingRatio(x, y)
Arguments¶
x: The x-coordinate values.y: The y-coordinate values.
Returns¶
The slope as a Float64.
Example¶
SELECT boundingRatio(number, number * 1.5) FROM numbers(10)
Result:
┌─boundingRatio(number, multiply(number, 1.5))─┐ │ 1.5 │ └──────────────────────────────────────────────┘
categoricalInformationValue¶
Calculates the Information Value (IV) for categorical features against a binary target variable. This metric helps assess the predictive power of categorical variables.
Syntax¶
categoricalInformationValue(category1, category2, ..., tag)
Arguments¶
category1, category2, ...: One or more categorical columns to analyze.tag: The binary target variable (ee.g., 0 or 1).
Returns¶
A Float64 representing the Information Value.
Example¶
SELECT categoricalInformationValue(category, target)
FROM values('category String, target UInt8', ('A', 0), ('A', 1), ('B', 0), ('B', 0), ('C', 1))
Result:
┌─categoricalInformationValue(category, target)─┐ │ 0.0 │ └───────────────────────────────────────────────┘
contingency¶
Calculates the contingency coefficient, a measure of association between two categorical variables. The coefficient ranges from 0 to 1, where higher values indicate a stronger association.
Syntax¶
contingency(column1, column2)
Arguments¶
column1: The first categorical column.column2: The second categorical column.
Returns¶
A Float64 value between 0 and 1.
Example¶
SELECT
contingency(a ,b)
FROM
(
SELECT
number % 10 AS a,
number % 4 AS b
FROM
numbers(150)
)
Result:
┌─contingency(a, b)─┐ │ 0.05812725261759165 │ └───────────────────┘
corr¶
Calculates the Pearson correlation coefficient between two numeric variables. This function uses a numerically unstable algorithm.
Syntax¶
corr(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The Pearson correlation coefficient as a Float64.
Example¶
SELECT corr(c2, c3) FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))
Result:
┌─corr(c2, c3)──────────┐ │ 0.1730265755453256 │ └───────────────────────┘
corrMatrix¶
Computes the correlation matrix for a set of numeric variables.
Syntax¶
corrMatrix(x[, ...])
Arguments¶
x: One or more numeric variables (Integer, Float, or Decimal).
Returns¶
A nested array representing the correlation matrix. Array(Array(Float64)).
Example¶
SELECT arrayMap(x -> round(x, 3), arrayJoin(corrMatrix(c1, c2, c3, c4))) AS corrMatrix FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))
Result:
┌─corrMatrix─────────────┐ │ [1,-0.096,0.243,0.746] │ │ [-0.096,1,0.173,0.106] │ │ [0.243,0.173,1,0.258] │ │ [0.746,0.106,0.258,1] │ └────────────────────────┘
corrStable¶
Calculates the Pearson correlation coefficient between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than corr.
Syntax¶
corrStable(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The Pearson correlation coefficient as a Float64.
Example¶
SELECT corrStable(c2, c3) FROM values((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))
Result:
┌─corrStable(c2, c3)─────┐ │ 0.17302657554532558 │ └────────────────────────┘
count¶
Counts the number of rows or the number of non-NULL values in a specified column.
Syntax¶
count() count(expr)
Arguments¶
- (No arguments): Counts all rows.
expr: An expression or column name. Counts rows whereexpris notNULL.
Returns¶
The count as a UInt64.
Example¶
SELECT count(), count(nullable_col)
FROM values('nullable_col Nullable(String)', 'A', 'B', NULL, 'C', NULL)
Result:
┌─count()─┬─count(nullable_col)─┐ │ 5 │ 3 │ └─────────┴─────────────────────┘
covarPop¶
Calculates the population covariance between two numeric variables. This function uses a numerically unstable algorithm.
Syntax¶
covarPop(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The population covariance as a Float64.
Example¶
SELECT covarPop(c2, c3) FROM values ((1, 5.6, -4.4),(2, -9.6, 3),(3, -1.3, -4),(4, 5.3, 9.7),(5, 4.4, 0.037),(6, -8.6, -7.8),(7, 5.1, 9.3),(8, 7.9, -3.6),(9, -8.2, 0.62),(10, -3, 7.3))
Result:
┌─covarPop(c2, c3)───┐ │ 6.485648 │ └────────────────────┘
covarPopMatrix¶
Computes the population covariance matrix for a set of numeric variables.
Syntax¶
covarPopMatrix(x[, ...])
Arguments¶
x: One or more numeric variables (Integer, Float, or Decimal).
Returns¶
A nested array representing the population covariance matrix. Array(Array(Float64)).
Example¶
SELECT arrayMap(x -> round(x, 3), arrayJoin(covarPopMatrix(c1, c2, c3, c4))) AS covarPopMatrix FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))
Result:
┌─covarPopMatrix────────────┐ │ [8.25,-1.76,4.08,6.748] │ │ [-1.76,41.07,6.486,2.132] │ │ [4.08,6.486,34.21,4.755] │ │ [6.748,2.132,4.755,9.93] │ └───────────────────────────┘
covarPopStable¶
Calculates the population covariance between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than covarPop.
Syntax¶
covarPopStable(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The population covariance as a Float64.
Example¶
SELECT covarPopStable(c2, c3) FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))
Result:
┌─covarPopStable(c2, c3)─┐ │ 6.485648 │ └────────────────────────┘
covarSamp¶
Calculates the sample covariance between two numeric variables. This function uses a numerically unstable algorithm.
Syntax¶
covarSamp(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The sample covariance as a Float64. Returns NaN if the sample size is 1 or less.
Example¶
SELECT covarSamp(c2, c3) FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))
Result:
┌─covarSamp(c2, c3)───┐ │ 7.206275555555556 │ └─────────────────────┘
covarSampMatrix¶
Computes the sample covariance matrix for a set of numeric variables.
Syntax¶
covarSampMatrix(x[, ...])
Arguments¶
x: One or more numeric variables (Integer, Float, or Decimal).
Returns¶
A nested array representing the sample covariance matrix. Array(Array(Float64)).
Example¶
SELECT arrayMap(x -> round(x, 3), arrayJoin(covarSampMatrix(c1, c2, c3, c4))) AS covarSampMatrix FROM values ((1, 5.6, -4.4, 2.6), (2, -9.6, 3, 3.3), (3, -1.3, -4, 1.2), (4, 5.3, 9.7, 2.3), (5, 4.4, 0.037, 1.222), (6, -8.6, -7.8, 2.1233), (7, 5.1, 9.3, 8.1222), (8, 7.9, -3.6, 9.837), (9, -8.2, 0.62, 8.43555), (10, -3, 7.3, 6.762))
Result:
┌─covarSampMatrix─────────────┐ │ [9.167,-1.956,4.534,7.498] │ │ [-1.956,45.634,7.206,2.369] │ │ [4.534,7.206,38.011,5.283] │ │ [7.498,2.369,5.283,11.034] │ └─────────────────────────────┘
covarSampStable¶
Calculates the sample covariance between two numeric variables using a numerically stable algorithm. This function is generally slower but provides more accurate results than covarSamp.
Syntax¶
covarSampStable(x, y)
Arguments¶
x: The first numeric variable (Integer, Float, or Decimal).y: The second numeric variable (Integer, Float, or Decimal).
Returns¶
The sample covariance as a Float64. Returns inf if the sample size is 1 or less.
Example¶
SELECT covarSampStable(c2, c3) FROM values ((1, 5.6,-4.4),(2, -9.6,3),(3, -1.3,-4),(4, 5.3,9.7),(5, 4.4,0.037),(6, -8.6,-7.8),(7, 5.1,9.3),(8, 7.9,-3.6),(9, -8.2,0.62),(10, -3,7.3))
Result:
┌─covarSampStable(x_value, y_value)─┐ │ 7.206275555555556 │ └───────────────────────────────────┘
cramersV¶
Calculates Cramer's V, a measure of association between two nominal variables. The value ranges from 0 (no association) to 1 (complete association).
Syntax¶
cramersV(column1, column2)
Arguments¶
column1: The first categorical column.column2: The second categorical column.
Returns¶
A Float64 value between 0 and 1.
Example¶
SELECT
cramersV(a, b)
FROM
(
SELECT
number % 10 AS a,
number % 5 AS b
FROM
numbers(150)
)
Result:
┌─────cramersV(a, b)─┐ │ 0.8944271909999159 │ └────────────────────┘
cramersVBiasCorrected¶
Calculates a bias-corrected version of Cramer's V, a measure of association between two nominal variables. This version aims to provide a more accurate estimate of the association, especially for smaller sample sizes.
Syntax¶
cramersVBiasCorrected(column1, column2)
Arguments¶
column1: The first categorical column.column2: The second categorical column.
Returns¶
A Float64 value between 0 and 1.
Example¶
SELECT
cramersV(a, b),
cramersVBiasCorrected(a ,b)
FROM
(
SELECT
number % 10 AS a,
number % 4 AS b
FROM
numbers(150)
)
Result:
┌──────cramersV(a, b)─┬─cramersVBiasCorrected(a, b)─┐ │ 0.41171788506213564 │ 0.33369281784141364 │ └─────────────────────┴─────────────────────────────┘
deltaSum¶
Calculates the sum of positive differences between consecutive values in a sequence. Negative differences are ignored. This function assumes the input data is already sorted.
Syntax¶
deltaSum(value)
Arguments¶
value: A numeric expression (Integer or Float).
Returns¶
The sum of positive differences, with a type matching the input value.
Example¶
SELECT deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))
Result:
┌─deltaSum(arrayJoin([1, 2, 3, 0, 3, 4, 2, 3]))─┐ │ 7 │ └───────────────────────────────────────────────┘
deltaSumTimestamp¶
Calculates the sum of positive differences between consecutive values, ordered by a specified timestamp. This function is useful for materialized views where rows might not be physically ordered by the value itself but by a timestamp.
Syntax¶
deltaSumTimestamp(value, timestamp)
Arguments¶
value: The numeric value (Integer, Float, Date, or DateTime).timestamp: The timestamp or ordering key (Integer, Float, Date, or DateTime).
Returns¶
The accumulated differences, with a type matching the value argument.
Example¶
SELECT deltaSumTimestamp(value, timestamp) FROM (SELECT number AS timestamp, [0, 4, 8, 3, 0, 0, 0, 1, 3, 5][number] AS value FROM numbers(1, 10))
Result:
┌─deltaSumTimestamp(value, timestamp)─┐ │ 13 │ └─────────────────────────────────────┘
entropy¶
Calculates the Shannon entropy of a column of values, which measures the average amount of information or uncertainty in the data.
Syntax¶
entropy(val)
Arguments¶
val: A column of values of any data type.
Returns¶
The Shannon entropy as a Float64.
Example¶
SELECT entropy(c1), entropy(c2) from values((1, 'A'), (1, 'A'), (1,'A'), (1,'A'), (2,'B'), (2,'B'), (2,'C'), (2,'D'))
Result:
┌─entropy(c1)─┬─entropy(c2)─┐ │ 1 │ 1.75 │ └─────────────┴─────────────┘
exponentialMovingAverage¶
Calculates the exponential moving average of a series of values, giving more weight to recent observations. The half-life parameter x determines the decay rate of older values.
Syntax¶
exponentialMovingAverage(x)(value, timeunit)
Parameters¶
x: The half-life period, a numeric value (Integer, Float, or Decimal).
Arguments¶
value: The numeric value (Integer, Float, or Decimal).timeunit: A numeric index representing time units (Integer, Float, or Decimal). This is not a timestamp but an index of the time interval.
Returns¶
The exponentially smoothed moving average at the latest point in time as a Float64.
Example¶
WITH data AS (
SELECT
number AS timestamp,
95 + (number % 5) AS temperature
FROM numbers(20)
)
SELECT exponentialMovingAverage(5)(temperature, timestamp) FROM data
Result:
┌─exponentialMovingAverage(5)(temperature, timestamp)──┐ │ 92.25779635374204 │ └──────────────────────────────────────────────────────┘
exponentialTimeDecayedAvg¶
Calculates the exponentially smoothed weighted moving average of a time series at a given point t, where older values have less influence.
Syntax¶
exponentialTimeDecayedAvg(x)(v, t)
Parameters¶
x: The half-life period, a numeric value (Integer, Float, or Decimal).
Arguments¶
v: The numeric value (Integer, Float, or Decimal).t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).
Returns¶
The exponentially smoothed weighted moving average at time t as a Float64.
Example¶
SELECT
value,
time,
round(exp_smooth, 3) AS exp_smooth
FROM
(
SELECT
(number = 0) OR (number >= 25) AS value,
number AS time,
exponentialTimeDecayedAvg(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
FROM numbers(5)
)
Result:
┌─value─┬─time─┬─exp_smooth─┐ │ 1 │ 0 │ 1 │ │ 0 │ 1 │ 0.475 │ │ 0 │ 2 │ 0.301 │ │ 0 │ 3 │ 0.214 │ │ 0 │ 4 │ 0.162 │ └───────┴──────┴────────────┘
exponentialTimeDecayedCount¶
Returns the cumulative exponential decay over a time series at a given point t. This function quantifies the total "presence" of events, with more recent events contributing more.
Syntax¶
exponentialTimeDecayedCount(x)(t)
Parameters¶
x: The half-life period, a numeric value (Integer, Float, or Decimal).
Arguments¶
t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).
Returns¶
The cumulative exponential decay at the given time point as a Float64.
Example¶
SELECT
value,
time,
round(exp_smooth, 3) AS exp_smooth
FROM
(
SELECT
(number % 5) = 0 AS value,
number AS time,
exponentialTimeDecayedCount(10)(time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
FROM numbers(5)
)
Result:
┌─value─┬─time─┬─exp_smooth─┐ │ 1 │ 0 │ 1 │ │ 0 │ 1 │ 1.905 │ │ 0 │ 2 │ 2.724 │ │ 0 │ 3 │ 3.464 │ │ 0 │ 4 │ 4.135 │ └───────┴──────┴────────────┘
exponentialTimeDecayedMax¶
Returns the maximum of the exponentially smoothed weighted moving average at time t compared to the value at t-1. This helps track the peak influence of recent events.
Syntax¶
exponentialTimeDecayedMax(x)(value, timeunit)
Parameters¶
x: The half-life period, a numeric value (Integer, Float, or Decimal).
Arguments¶
value: The numeric value (Integer, Float, or Decimal).timeunit: The time point (Integer, Float, Decimal, DateTime, or DateTime64).
Returns¶
The maximum of the exponentially smoothed weighted moving average at t and t-1 as a Float64.
Example¶
SELECT
value,
time,
round(exp_smooth, 3) AS exp_smooth
FROM
(
SELECT
(number = 0) OR (number >= 25) AS value,
number AS time,
exponentialTimeDecayedMax(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
FROM numbers(5)
)
Result:
┌─value─┬─time─┬─exp_smooth─┐ │ 1 │ 0 │ 1 │ │ 0 │ 1 │ 0.905 │ │ 0 │ 2 │ 0.819 │ │ 0 │ 3 │ 0.741 │ │ 0 │ 4 │ 0.67 │ └───────┴──────┴────────────┘
exponentialTimeDecayedSum¶
Returns the sum of exponentially smoothed moving average values of a time series at a given point t. This function provides a weighted sum where recent values contribute more.
Syntax¶
exponentialTimeDecayedSum(x)(v, t)
Parameters¶
x: The half-life period, a numeric value (Integer, Float, or Decimal).
Arguments¶
v: The numeric value (Integer, Float, or Decimal).t: The time point (Integer, Float, Decimal, DateTime, or DateTime64).
Returns¶
The sum of exponentially smoothed moving average values at the given time point as a Float64.
Example¶
SELECT
value,
time,
round(exp_smooth, 3) AS exp_smooth
FROM
(
SELECT
(number = 0) OR (number >= 25) AS value,
number AS time,
exponentialTimeDecayedSum(10)(value, time) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS exp_smooth
FROM numbers(5)
)
Result:
┌─value─┬─time─┬─exp_smooth─┐ │ 1 │ 0 │ 1 │ │ 0 │ 1 │ 0.905 │ │ 0 │ 2 │ 0.819 │ │ 0 │ 3 │ 0.741 │ │ 0 │ 4 │ 0.67 │ └───────┴──────┴────────────┘
first_value¶
Selects the first encountered value of a column. This function is an alias for any but explicitly supports the RESPECT NULLS modifier, making it suitable for window functions or when NULL values need to be considered.
Syntax¶
first_value(column) [RESPECT NULLS]
Arguments¶
column: The column from which to select the first value.
Returns¶
The first value encountered. If RESPECT NULLS is used, it returns the first value, even if it's NULL. Otherwise, it skips NULLs. The return type matches the input column's type.
Example¶
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT first_value(city), first_value(city) RESPECT NULLS FROM cte
Result:
┌─first_value(city)─┬─first_value(city) RESPECT NULLS─┐ │ Amsterdam │ ᴺᵁᴸᴸ │ └───────────────────┴─────────────────────────────────┘
groupArray¶
Creates an array containing all values of an argument within a group. The order of elements in the resulting array is non-deterministic.
Syntax¶
groupArray(x) groupArray(max_size)(x)
Alias: array_agg
Arguments¶
x: The expression or column whose values are to be collected into an array.max_size: An optionalUInt64parameter that limits the size of the resulting array.
Returns¶
An array containing the collected values. NULL values are excluded from the array. Array(T).
Example¶
WITH data AS (SELECT number % 3 AS id, toString(number) AS name FROM numbers(5)) SELECT id, groupArray(name) FROM data GROUP BY id ORDER BY id
Result:
┌─id─┬─groupArray(name)─┐ │ 0 │ ['0','3'] │ │ 1 │ ['1','4'] │ │ 2 │ ['2'] │ └────┴──────────────────┘
groupArrayInsertAt¶
Inserts a value into an array at a specified position. If multiple values are inserted into the same position within a single-threaded query, the first value is used. In multi-threaded queries, the result is non-deterministic.
Syntax¶
groupArrayInsertAt(default_x, size)(x, pos)
Arguments¶
x: The value to insert.pos: TheUInt32index wherexshould be inserted (0-based).default_x: An optional expression for the default value to fill empty positions. If omitted, the default value for the data type ofxis used.size: An optionalUInt32parameter specifying the fixed length of the resulting array. Requiresdefault_xto be specified.
Returns¶
An array with the value inserted at the specified position. Array(T).
Example¶
SELECT groupArrayInsertAt('-', 5)(toString(number), number * 2) FROM numbers(3)
Result:
┌─groupArrayInsertAt('-', 5)(toString(number), multiply(number, 2))─┐
│ ['0','-','1','-','2'] │
└───────────────────────────────────────────────────────────────────┘
groupArrayIntersect¶
Returns an array containing elements that are present in all input arrays within a group (the intersection of arrays).
Syntax¶
groupArrayIntersect(x)
Arguments¶
x: An array expression or column.
Returns¶
An array containing only the common elements found in all input arrays. Array(T).
Example¶
WITH data AS (
SELECT arrayJoin([
[1,2,4],
[1,5,2,8,-1,0],
[1,5,7,5,8,2]
]) AS a
)
SELECT groupArrayIntersect(a) AS intersection FROM data
Result:
┌─intersection─┐ │ [1,2] │ └──────────────┘
groupArrayLast¶
Creates an array containing the last encountered values of an argument within a group. The order of elements in the resulting array is non-deterministic unless the data is explicitly ordered.
Syntax¶
groupArrayLast(max_size)(x)
Arguments¶
max_size: An optionalUInt64parameter that limits the size of the resulting array.x: The expression or column whose values are to be collected into an array.
Returns¶
An array containing the last collected values. Array(T).
Example¶
SELECT groupArrayLast(2)(number+1) AS numbers FROM numbers(10)
Result:
┌─numbers─┐ │ [9,10] │ └─────────┘
groupArrayMovingAvg¶
Calculates the moving average of numeric input values over a specified window.
Syntax¶
groupArrayMovingAvg(numbers_for_summing) groupArrayMovingAvg(window_size)(numbers_for_summing)
Arguments¶
numbers_for_summing: A numeric expression.window_size: An optionalUInt64parameter defining the size of the moving window. If omitted, the window size defaults to the total number of rows.
Returns¶
An array of the same size and type as the input, where each element is the moving average. Array(T).
Example¶
SELECT groupArrayMovingAvg(2)(number) FROM numbers(5)
Result:
┌─groupArrayMovingAvg(2)(number)─┐ │ [0,0,1,2,3] │ └────────────────────────────────┘
groupArrayMovingSum¶
Calculates the moving sum of numeric input values over a specified window.
Syntax¶
groupArrayMovingSum(numbers_for_summing) groupArrayMovingSum(window_size)(numbers_for_summing)
Arguments¶
numbers_for_summing: A numeric expression.window_size: An optionalUInt64parameter defining the size of the moving window. If omitted, the window size defaults to the total number of rows.
Returns¶
An array of the same size and type as the input, where each element is the moving sum. Array(T).
Example¶
SELECT groupArrayMovingSum(2)(number) FROM numbers(5)
Result:
┌─groupArrayMovingSum(2)(number)─┐ │ [0,1,3,5,7] │ └────────────────────────────────┘
groupArraySample¶
Creates an array of randomly sampled argument values. The size of the resulting array is limited to max_size elements.
Syntax¶
groupArraySample(max_size[, seed])(x)
Arguments¶
max_size: The maximumUInt64size of the resulting array.seed: An optionalUInt64seed for the random number generator. Default:123456.x: The expression or column whose values are to be sampled.
Returns¶
An array of randomly selected x arguments. Array(T).
Example¶
WITH colors AS (SELECT arrayJoin(['red', 'blue', 'green', 'white', 'orange']) AS color) SELECT groupArraySample(3)(color) AS newcolors FROM colors
Result:
┌─newcolors──────────────────┐ │ ['white','blue','green'] │ └────────────────────────────┘
groupArraySorted¶
Returns an array containing the first N items from a group, sorted in ascending order.
Syntax¶
groupArraySorted(N)(column)
Arguments¶
N: TheUInt64number of elements to return.column: The column whose values are to be collected and sorted.
Returns¶
An array of the first N sorted items. Array(T).
Example¶
SELECT groupArraySorted(5)(number) FROM numbers(10)
Result:
┌─groupArraySorted(5)(number)─┐ │ [0,1,2,3,4] │ └─────────────────────────────┘
groupBitAnd¶
Applies a bit-wise AND operation to all numeric values within a group.
Syntax¶
groupBitAnd(expr)
Arguments¶
expr: A numeric expression (UInt* or Int*).
Returns¶
The result of the bit-wise AND operation, with a type matching expr.
Example¶
WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num) SELECT groupBitAnd(num) FROM data
Result:
┌─groupBitAnd(num)─┐ │ 4 │ └──────────────────┘
groupBitmap¶
Creates a bitmap from a column of unsigned integers and returns its cardinality (number of unique elements).
Syntax¶
groupBitmap(expr)
Arguments¶
expr: An unsigned integer expression (UInt*).
Returns¶
The cardinality of the bitmap as a UInt64.
Example¶
WITH data AS (SELECT arrayJoin([1, 1, 2, 3]) AS UserID) SELECT groupBitmap(UserID) AS num FROM data
Result:
┌─num─┐ │ 3 │ └─────┘
groupBitmapAnd¶
Calculates the cardinality of the bit-wise AND operation across multiple bitmap states.
Syntax¶
groupBitmapAnd(expr)
Arguments¶
expr: An expression that results in anAggregateFunction(groupBitmap, UInt*)state.
Returns¶
The cardinality of the resulting bitmap as a UInt64.
Example¶
WITH data AS ( SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2) UNION ALL SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3) ) SELECT groupBitmapAnd(bm) FROM data
Result:
┌─groupBitmapAnd(bm)─┐ │ 2 │ └────────────────────┘
groupBitmapOr¶
Calculates the cardinality of the bit-wise OR operation across multiple bitmap states. This is equivalent to groupBitmapMerge.
Syntax¶
groupBitmapOr(expr)
Arguments¶
expr: An expression that results in anAggregateFunction(groupBitmap, UInt*)state.
Returns¶
The cardinality of the resulting bitmap as a UInt64.
Example¶
WITH data AS ( SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2) UNION ALL SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3) ) SELECT groupBitmapOr(bm) FROM data
Result:
┌─groupBitmapOr(bm)─┐ │ 4 │ └───────────────────┘
groupBitmapXor¶
Calculates the cardinality of the bit-wise XOR operation across multiple bitmap states.
Syntax¶
groupBitmapXor(expr)
Arguments¶
expr: An expression that results in anAggregateFunction(groupBitmap, UInt*)state.
Returns¶
The cardinality of the resulting bitmap as a UInt64.
Example¶
WITH data AS ( SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (0,1,2) UNION ALL SELECT groupBitmapState(number) AS bm FROM numbers(5) WHERE number IN (1,2,3) ) SELECT groupBitmapXor(bm) FROM data
Result:
┌─groupBitmapXor(bm)─┐ │ 2 │ └────────────────────┘
groupBitOr¶
Applies a bit-wise OR operation to all numeric values within a group.
Syntax¶
groupBitOr(expr)
Arguments¶
expr: A numeric expression (UInt* or Int*).
Returns¶
The result of the bit-wise OR operation, with a type matching expr.
Example¶
WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num) SELECT groupBitOr(num) FROM data
Result:
┌─groupBitOr(num)─┐ │ 125 │ └─────────────────┘
groupBitXor¶
Applies a bit-wise XOR operation to all numeric values within a group.
Syntax¶
groupBitXor(expr)
Arguments¶
expr: A numeric expression (UInt* or Int*).
Returns¶
The result of the bit-wise XOR operation, with a type matching expr.
Example¶
WITH data AS (SELECT arrayJoin([44, 28, 13, 85]) AS num) SELECT groupBitXor(num) FROM data
Result:
┌─groupBitXor(num)─┐ │ 104 │ └──────────────────┘
groupConcat¶
Concatenates strings from a group into a single string, with optional delimiter and limit.
Syntax¶
groupConcat[(delimiter [, limit])](expression)
Arguments¶
expression: The string expression or column to concatenate.delimiter: An optionalStringto separate concatenated values. Defaults to an empty string.limit: An optionalUInt64to specify the maximum number of elements to concatenate.
Returns¶
A String containing the concatenated values. If the group is empty or contains only NULLs, it returns NULL (if the function doesn't specify NULL handling).
Example¶
WITH Employees AS (SELECT arrayJoin(['John', 'Jane', 'Bob']) AS Name)
SELECT groupConcat(', ', 2)(Name) FROM Employees
Result:
┌─groupConcat(', ', 2)(Name)─┐
│ John, Jane │
└────────────────────────────┘
groupUniqArray¶
Creates an array containing unique values of an argument within a group. Memory consumption is similar to uniqExact.
Syntax¶
groupUniqArray(x) groupUniqArray(max_size)(x)
Arguments¶
x: The expression or column whose unique values are to be collected into an array.max_size: An optionalUInt64parameter that limits the size of the resulting array.
Returns¶
An array containing the unique collected values. Array(T).
Example¶
WITH data AS (SELECT arrayJoin([1, 2, 2, 3, 1, 4]) AS num) SELECT groupUniqArray(num) FROM data
Result:
┌─groupUniqArray(num)─┐ │ [1,2,3,4] │ └─────────────────────┘
intervalLengthSum¶
Calculates the total length of the union of all intervals (segments on a numeric axis) defined by start and end points.
Syntax¶
intervalLengthSum(start, end)
Arguments¶
start: The starting value of the interval (numeric, Date, or DateTime).end: The ending value of the interval (numeric, Date, or DateTime). Arguments must be of the same data type.
Returns¶
The total length of the union of intervals. UInt64 for integer/date types, Float64 for float types.
Example¶
WITH fl_interval AS ( SELECT 'a' AS id, arrayJoin([(1.1, 2.9), (2.5, 3.2), (4.0, 5.0)]) AS interval_pair ) SELECT id, intervalLengthSum(interval_pair.1, interval_pair.2) AS total_length FROM fl_interval GROUP BY id
Result:
┌─id─┬─total_length─┐ │ a │ 3.1 │ └────┴──────────────┘
kolmogorovSmirnovTest¶
Applies the Kolmogorov-Smirnov test to compare two samples and determine if they come from the same continuous probability distribution.
Syntax¶
kolmogorovSmirnovTest([alternative, computation_method])(sample_data, sample_index)
Parameters¶
alternative: OptionalStringspecifying the alternative hypothesis ('two-sided','greater','less'). Default:'two-sided'.computation_method: OptionalStringspecifying the p-value computation method ('exact','asymp','auto'). Default:'auto'.
Arguments¶
sample_data: The numeric values from both samples (Integer, Float, or Decimal).sample_index: AnInteger(0 or 1) indicating which sample thesample_databelongs to.
Returns¶
A tuple containing the calculated statistic and the p-value. Tuple(Float64, Float64).
Example¶
SELECT kolmogorovSmirnovTest('less', 'exact')(value, num)
FROM
(
SELECT
randNormal(0, 10) AS value,
0 AS num
FROM numbers(100)
UNION ALL
SELECT
randNormal(0, 10) AS value,
1 AS num
FROM numbers(100)
)
Result:
┌─kolmogorovSmirnovTest('less', 'exact')(value, num)─┐
│ (0.009999999999999998,0.37528595205132287) │
└────────────────────────────────────────────────────┘
kurtPop¶
Computes the population kurtosis of a sequence of numbers, which measures the "tailedness" of the probability distribution.
Syntax¶
kurtPop(expr)
Arguments¶
expr: A numeric expression.
Returns¶
The population kurtosis as a Float64.
Example¶
SELECT kurtPop(number) FROM numbers(10)
Result:
┌─kurtPop(number)─┐ │ -1.224 │ └─────────────────┘
kurtSamp¶
Computes the sample kurtosis of a sequence of numbers, providing an unbiased estimate of the kurtosis of a random variable from a sample.
Syntax¶
kurtSamp(expr)
Arguments¶
expr: A numeric expression.
Returns¶
The sample kurtosis as a Float64. Returns NaN if the sample size is 1 or less.
Example¶
SELECT kurtSamp(number) FROM numbers(10)
Result:
┌─kurtSamp(number)─┐ │ -1.224242424242424 │ └──────────────────┘
largestTriangleThreeBuckets¶
Applies the Largest-Triangle-Three-Buckets (LTTB) algorithm to downsample time series data for visualization. It selects N points that best represent the original series, preserving visual trends.
Syntax¶
largestTriangleThreeBuckets(n)(x, y)
Alias: lttb
Parameters¶
n: TheUInt64number of points to return in the downsampled series.
Arguments¶
x: The x-coordinate (Integer, Float, Decimal, Date, Date32, DateTime, or DateTime64).y: The y-coordinate (Integer, Float, Decimal, Date, Date32, DateTime, or DateTime64).NaNvalues in the series are ignored.
Returns¶
An array of tuples, where each tuple contains the x and y coordinates of the selected points. Array(Tuple(T_x, T_y)).
Example¶
WITH data AS (
SELECT arrayJoin([
(1.0, 10.0), (2.0, 20.0), (3.0, 15.0), (8.0, 60.0),
(9.0, 55.0), (10.0, 70.0), (4.0, 30.0), (5.0, 40.0),
(6.0, 35.0), (7.0, 50.0)
]) AS point
)
SELECT largestTriangleThreeBuckets(4)(point.1, point.2) FROM data
Result:
┌─largestTriangleThreeBuckets(4)(point.1, point.2)─┐ │ [(1,10),(3,15),(9,55),(10,70)] │ └──────────────────────────────────────────────────┘
last_value¶
Selects the last encountered value of a column. This function is an alias for anyLast but explicitly supports the RESPECT NULLS modifier, making it suitable for window functions or when NULL values need to be considered.
Syntax¶
last_value(column) [RESPECT NULLS]
Arguments¶
column: The column from which to select the last value.
Returns¶
The last value encountered. If RESPECT NULLS is used, it returns the last value, even if it's NULL. Otherwise, it skips NULLs. The return type matches the input column's type.
Example¶
WITH cte AS (SELECT arrayJoin([NULL, 'Amsterdam', 'New York', 'Tokyo', 'Valencia', NULL]) as city) SELECT last_value(city), last_value(city) RESPECT NULLS FROM cte
Result:
┌─last_value(city)─┬─last_value(city) RESPECT NULLS─┐ │ Valencia │ ᴺᵁᴸᴸ │ └──────────────────┴────────────────────────────────┘
mannWhitneyUTest¶
Applies the Mann-Whitney U-test (also known as Wilcoxon rank-sum test) to compare two independent samples and determine if they come from the same distribution. This non-parametric test does not assume normal distribution.
Syntax¶
mannWhitneyUTest[(alternative[, continuity_correction])](sample_data, sample_index)
Parameters¶
alternative: OptionalStringspecifying the alternative hypothesis ('two-sided','greater','less'). Default:'two-sided'.continuity_correction: OptionalUInt64(0 or 1) to apply continuity correction in the normal approximation for the p-value. Default: 1.
Arguments¶
sample_data: The numeric values from both samples (Integer, Float, or Decimal).sample_index: AnInteger(0 or 1) indicating which sample thesample_databelongs to.
Returns¶
A tuple containing the calculated U-statistic and the p-value. Tuple(Float64, Float64).
Example¶
WITH mww_ttest AS (
SELECT arrayJoin([(10,0), (11,0), (12,0), (1,1), (2,1), (3,1)]) AS data_pair
)
SELECT mannWhitneyUTest('greater')(data_pair.1, data_pair.2) FROM mww_ttest
Result:
┌─mannWhitneyUTest('greater')(data_pair.1, data_pair.2)─┐
│ (9,0.04042779918503192) │
└───────────────────────────────────────────────────────┘
max¶
Calculates the maximum value within a group.
Syntax¶
max(expr)
Arguments¶
expr: Any comparable expression.
Returns¶
The maximum value. The return type matches the expr type.
Example¶
SELECT max(number) FROM numbers(10)
Result:
┌─max(number)─┐ │ 9 │ └─────────────┘
maxIntersections¶
Calculates the maximum number of times that a group of intervals (defined by start and end points) overlap each other.
Syntax¶
maxIntersections(start_column, end_column)
Arguments¶
start_column: A numeric column representing the start of each interval.NULLor 0 values are skipped.end_column: A numeric column representing the end of each interval.NULLor 0 values are skipped.
Returns¶
The maximum number of intersected intervals as a UInt64.
Example¶
WITH intervals AS ( SELECT arrayJoin([(1, 5), (3, 7), (6, 10), (2, 4)]) AS interval_pair ) SELECT maxIntersections(interval_pair.1, interval_pair.2) FROM intervals
Result:
┌─maxIntersections(interval_pair.1, interval_pair.2)─┐ │ 3 │ └────────────────────────────────────────────────────┘
maxIntersectionsPosition¶
Calculates the starting positions where the maximum number of interval overlaps occur.
Syntax¶
maxIntersectionsPosition(start_column, end_column)
Arguments¶
start_column: A numeric column representing the start of each interval.NULLor 0 values are skipped.end_column: A numeric column representing the end of each interval.NULLor 0 values are skipped.
Returns¶
An array of starting positions where the maximum number of intersections occur. Array(T).
Example¶
WITH intervals AS ( SELECT arrayJoin([(1, 5), (3, 7), (6, 10), (2, 4)]) AS interval_pair ) SELECT maxIntersectionsPosition(interval_pair.1, interval_pair.2) FROM intervals
Result:
┌─maxIntersectionsPosition(interval_pair.1, interval_pair.2)─┐ │ [3] │ └────────────────────────────────────────────────────────────┘
maxMap¶
Calculates the maximum value for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding maximum values.
Syntax¶
maxMap(key, value) maxMap(Tuple(key, value))
Alias: maxMappedArrays
Arguments¶
key: An array of keys.value: An array of values. The number of elements inkeyandvaluearrays must be the same for each row.
Returns¶
A tuple containing two arrays: sorted keys and their maximum values. Tuple(Array(T_key), Array(T_value)).
Example¶
SELECT maxMap(a, b)
FROM values('a Array(String), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))
Result:
┌─maxMap(a, b)───────────┐ │ [['x','y','z'],[2,3,1]]│ └────────────────────────┘
meanZTest¶
Applies a Z-test for means to compare two populations, assuming normal distribution and known population variances. It determines if the means of the populations are equal.
Syntax¶
meanZTest(population_variance_x, population_variance_y, confidence_level)(sample_data, sample_index)
Parameters¶
population_variance_x: The known variance of the first population (Float).population_variance_y: The known variance of the second population (Float).confidence_level: TheFloatconfidence level for calculating confidence intervals.
Arguments¶
sample_data: The numeric values from both samples (Integer, Float, or Decimal).sample_index: AnInteger(0 or 1) indicating which sample thesample_databelongs to.
Returns¶
A tuple containing the calculated t-statistic, p-value, and the lower and upper bounds of the confidence interval. Tuple(Float64, Float64, Float64, Float64).
Example¶
WITH mean_ztest AS ( SELECT arrayJoin([(20.3,0), (21.9,0), (22.1,0), (18.9,1), (19.0,1), (20.3,1)]) AS data_pair ) SELECT meanZTest(0.7, 0.45, 0.95)(data_pair.1, data_pair.2) FROM mean_ztest
Result:
┌─meanZTest(0.7, 0.45, 0.95)(data_pair.1, data_pair.2)────────────────────────────┐ │ (3.2841296025548123,0.0010229786769086013,0.8198428246768334,3.2468238419898365) │ └──────────────────────────────────────────────────────────────────────────────────┘
median¶
Calculates the median of a numeric data sample. This function is an alias for quantile.
Syntax¶
median(expr)
Aliases: medianDeterministic, medianExact, medianExactWeighted, medianTiming, medianTimingWeighted, medianTDigest, medianTDigestWeighted, medianBFloat16, medianDD. These are aliases for their respective quantile* functions.
Arguments¶
expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The approximate median of the data. The return type depends on the input expr type (Float64 for numeric, Date for Date, DateTime for DateTime).
Example¶
SELECT median(number) FROM numbers(5)
Result:
┌─median(number)─┐ │ 2.5 │ └────────────────┘
min¶
Calculates the minimum value within a group.
Syntax¶
min(expr)
Arguments¶
expr: Any comparable expression.
Returns¶
The minimum value. The return type matches the expr type.
Example¶
SELECT min(number) FROM numbers(10)
Result:
┌─min(number)─┐ │ 0 │ └─────────────┘
minMap¶
Calculates the minimum value for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding minimum values.
Syntax¶
minMap(key, value) minMap(Tuple(key, value))
Alias: minMappedArrays
Arguments¶
key: An array of keys.value: An array of values. The number of elements inkeyandvaluearrays must be the same for each row.
Returns¶
A tuple containing two arrays: sorted keys and their minimum values. Tuple(Array(T_key), Array(T_value)).
Example¶
SELECT minMap(a, b)
FROM values('a Array(Int32), b Array(Int64)', ([1, 2], [2, 2]), ([2, 3], [1, 1]))
Result:
┌─minMap(a, b)──────┐ │ ([1,2,3],[2,1,1]) │ └───────────────────┘
quantile¶
Computes an approximate quantile of a numeric data sequence using reservoir sampling. The result is non-deterministic.
Syntax¶
quantile(level)(expr)
Alias: median
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5 (median).expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime. Returns NaN for an empty numeric sequence.
Example¶
SELECT quantile(number) FROM numbers(5)
Result:
┌─quantile(number)─┐ │ 2.5 │ └────────────────┘
quantileBFloat16¶
Computes an approximate quantile of a numeric data sequence, optimized for bfloat16 numbers. It offers fast estimation with a relative error of no more than 0.390625%.
Syntax¶
quantileBFloat16[(level)](expr)
Alias: medianBFloat16
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer or Float).
Returns¶
The approximate quantile of the specified level as a Float64.
Example¶
SELECT quantileBFloat16(0.75)(number) FROM numbers(5)
Result:
┌─quantileBFloat16(0.75)(number)─┐ │ 3 │ └────────────────────────────────┘
quantileBFloat16Weighted¶
Computes an approximate quantile of a numeric data sequence, considering the weight of each element, optimized for bfloat16 numbers.
Syntax¶
quantileBFloat16Weighted[(level)](expr, weight)
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer or Float).weight: A numeric expression representing the weight of eachexprvalue.
Returns¶
The approximate quantile of the specified level as a Float64.
Example¶
SELECT quantileBFloat16Weighted(0.75)(number, 1) FROM numbers(5)
Result:
┌─quantileBFloat16Weighted(0.75)(number, 1)─┐ │ 3 │ └───────────────────────────────────────────┘
quantileDD¶
Computes an approximate quantile of a numeric data sequence with relative-error guarantees. The accuracy can be controlled by a parameter.
Syntax¶
quantileDD(relative_accuracy, [level])(expr)
Alias: medianDD
Parameters¶
relative_accuracy: AFloat64(0 to 1) specifying the desired relative accuracy. Recommended: 0.001 or higher.level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.
Arguments¶
expr: A numeric expression (Integer or Float).
Returns¶
The approximate quantile of the specified level as a Float64.
Example¶
SELECT quantileDD(0.01, 0.75)(number) FROM numbers(5)
Result:
┌─quantileDD(0.01, 0.75)(number)─┐ │ 3.0 │ └────────────────────────────────┘
quantileDeterministic¶
Computes an approximate quantile of a numeric data sequence using a deterministic reservoir sampling algorithm. The result is consistent across executions.
Syntax¶
quantileDeterministic(level)(expr, determinator)
Alias: medianDeterministic
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).determinator: A positive numeric value whose hash is used for deterministic sampling.
Returns¶
The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileDeterministic(number, 1) FROM numbers(5)
Result:
┌─quantileDeterministic(number, 1)─┐ │ 2.5 │ └──────────────────────────────────┘
quantileExact¶
Exactly computes the quantile of a numeric data sequence by collecting all values into an array and partially sorting it. This function consumes O(n) memory.
Syntax¶
quantileExact(level)(expr)
Alias: medianExact
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileExact(number) FROM numbers(10)
Result:
┌─quantileExact(number)─┐ │ 5 │ └───────────────────────┘
quantileExactLow¶
Exactly computes the quantile of a numeric data sequence. For a level of 0.5, it returns the lower median value for an even number of elements and the middle median for an odd number.
Syntax¶
quantileExactLow(level)(expr)
Alias: medianExactLow
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileExactLow(number) FROM numbers(10)
Result:
┌─quantileExactLow(number)─┐ │ 4 │ └──────────────────────────┘
quantileExactHigh¶
Exactly computes the quantile of a numeric data sequence. For a level of 0.5, it returns the higher median value for an even number of elements and the middle median for an odd number.
Syntax¶
quantileExactHigh(level)(expr)
Alias: medianExactHigh
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileExactHigh(number) FROM numbers(10)
Result:
┌─quantileExactHigh(number)─┐ │ 5 │ └───────────────────────────┘
quantileExactExclusive¶
Exactly computes the quantile of a numeric data sequence, equivalent to the PERCENTILE.EXC Excel function (type R6). The level parameter excludes 0 and 1.
Syntax¶
quantileExactExclusive(level)(expr)
Arguments¶
level: An optionalFloat64parameter (0 to 1, exclusive) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileExactExclusive(0.5)(number) FROM numbers(10)
Result:
┌─quantileExactExclusive(0.5)(number)─┐ │ 4.5 │ └─────────────────────────────────────┘
quantileExactInclusive¶
Exactly computes the quantile of a numeric data sequence, equivalent to the PERCENTILE.INC Excel function (type R7). The level parameter includes 0 and 1.
Syntax¶
quantileExactInclusive(level)(expr)
Arguments¶
level: An optionalFloat64parameter (0 to 1, inclusive) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileExactInclusive(0.5)(number) FROM numbers(10)
Result:
┌─quantileExactInclusive(0.5)(number)─┐ │ 5 │ └─────────────────────────────────────┘
quantileExactWeighted¶
Exactly computes the quantile of a numeric data sequence, taking into account the weight of each element. It uses a hash table for efficiency, especially with repeated values.
Syntax¶
quantileExactWeighted(level)(expr, weight)
Alias: medianExactWeighted
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).weight: AUInt64column representing the weight (number of occurrences) of eachexprvalue.
Returns¶
The exact quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair) SELECT quantileExactWeighted(pair.1, pair.2) FROM data
Result:
┌─quantileExactWeighted(pair.1, pair.2)─┐ │ 1 │ └───────────────────────────────────────┘
quantileExactWeightedInterpolated¶
Computes the quantile of a numeric data sequence using linear interpolation and considering the weight of each element. It builds a cumulative distribution based on weights.
Syntax¶
quantileExactWeightedInterpolated(level)(expr, weight)
Alias: medianExactWeightedInterpolated
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).weight: AUInt64column representing the weight (number of occurrences) of eachexprvalue.
Returns¶
The interpolated quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair) SELECT quantileExactWeightedInterpolated(pair.1, pair.2) FROM data
Result:
┌─quantileExactWeightedInterpolated(pair.1, pair.2)─┐ │ 1.5 │ └───────────────────────────────────────────────────┘
quantileGK¶
Computes an approximate quantile of a numeric data sequence using the Greenwald-Khanna algorithm, which is efficient for data streams and allows control over accuracy.
Syntax¶
quantileGK(accuracy, level)(expr)
Alias: medianGK
Arguments¶
accuracy: A positiveUInt64integer controlling the accuracy. Higher values mean less error but more resources.level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The approximate quantile of the specified level and accuracy. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileGK(100, 0.25)(number + 1) FROM numbers(1000)
Result:
┌─quantileGK(100, 0.25)(plus(number, 1))─┐ │ 251 │ └────────────────────────────────────────┘
quantileInterpolatedWeighted¶
Computes the quantile of a numeric data sequence using linear interpolation and considering the weight of each element. It sorts values by weight and performs interpolation.
Syntax¶
quantileInterpolatedWeighted(level)(expr, weight)
Alias: medianInterpolatedWeighted
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).weight: A numeric column representing the weight of eachexprvalue.
Returns¶
The interpolated quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
WITH data AS (SELECT arrayJoin([(0,3), (1,2), (2,1), (5,4)]) AS pair) SELECT quantileInterpolatedWeighted(pair.1, pair.2) FROM data
Result:
┌─quantileInterpolatedWeighted(pair.1, pair.2)─┐ │ 1 │ └──────────────────────────────────────────────┘
quantiles¶
This is a family of functions that calculate multiple quantiles simultaneously for a given data sequence. They return an array of results for the specified levels.
Syntax¶
quantiles(level1, level2, ...)(x)
This general syntax applies to specific implementations like quantilesDeterministic, quantilesTiming, quantilesTimingWeighted, quantilesExact, quantilesExactWeighted, quantileExactWeightedInterpolated, quantileInterpolatedWeighted, quantilesTDigest, quantilesBFloat16, quantilesDD.
Arguments¶
level1, level2, ...: One or moreFloat64parameters (0 to 1) specifying the quantile levels.x: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
An array of quantiles for the specified levels. Array(T).
Example¶
SELECT quantiles(0.25, 0.5, 0.75)(number) FROM numbers(10)
Result:
┌─quantiles(0.25, 0.5, 0.75)(number)─┐ │ [2.5,5,7.5] │ └────────────────────────────────────┘
quantilesExactExclusive¶
Exactly computes multiple quantiles of a numeric data sequence, equivalent to the PERCENTILE.EXC Excel function (type R6). It is more efficient for sets of levels than calling quantileExactExclusive multiple times.
Syntax¶
quantilesExactExclusive(level1, level2, ...)(expr)
Arguments¶
level1, level2, ...: One or moreFloat64parameters (0 to 1, exclusive) specifying the quantile levels.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
An array of exact quantiles for the specified levels. Array(T).
Example¶
SELECT quantilesExactExclusive(0.25, 0.5, 0.75)(number) FROM numbers(10)
Result:
┌─quantilesExactExclusive(0.25, 0.5, 0.75)(number)─┐ │ [2.25,4.5,6.75] │ └──────────────────────────────────────────────────┘
quantilesExactInclusive¶
Exactly computes multiple quantiles of a numeric data sequence, equivalent to the PERCENTILE.INC Excel function (type R7). It is more efficient for sets of levels than calling quantileExactInclusive multiple times.
Syntax¶
quantilesExactInclusive(level1, level2, ...)(expr)
Arguments¶
level1, level2, ...: One or moreFloat64parameters (0 to 1, inclusive) specifying the quantile levels.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
An array of exact quantiles for the specified levels. Array(T).
Example¶
SELECT quantilesExactInclusive(0.25, 0.5, 0.75)(number) FROM numbers(10)
Result:
┌─quantilesExactInclusive(0.25, 0.5, 0.75)(number)─┐ │ [2.25,4.5,6.75] │ └──────────────────────────────────────────────────┘
quantilesGK¶
Computes multiple approximate quantiles of a numeric data sequence using the Greenwald-Khanna algorithm, allowing control over accuracy and returning an array of results.
Syntax¶
quantilesGK(accuracy, level1, level2, ...)(expr)
Arguments¶
accuracy: A positiveUInt64integer controlling the accuracy.level1, level2, ...: One or moreFloat64parameters (0 to 1) specifying the quantile levels.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
An array of approximate quantiles for the specified levels. Array(T).
Example¶
SELECT quantilesGK(100, 0.25, 0.5, 0.75)(number + 1) FROM numbers(1000)
Result:
┌─quantilesGK(100, 0.25, 0.5, 0.75)(plus(number, 1))─┐ │ [251,498,741] │ └────────────────────────────────────────────────────┘
quantileTDigest¶
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm. This function is memory-efficient (log(n)) but non-deterministic.
Syntax¶
quantileTDigest(level)(expr)
Alias: medianTDigest
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).
Returns¶
The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileTDigest(number) FROM numbers(10)
Result:
┌─quantileTDigest(number)─┐ │ 4.5 │ └─────────────────────────┘
quantileTDigestWeighted¶
Computes an approximate quantile of a numeric data sequence using the t-digest algorithm, considering the weight of each element. It is memory-efficient (log(n)) but non-deterministic.
Syntax¶
quantileTDigestWeighted(level)(expr, weight)
Alias: medianTDigestWeighted
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: A numeric expression (Integer, Float, Decimal, Date, or DateTime).weight: A numeric column representing the weight of eachexprvalue.
Returns¶
The approximate quantile of the specified level. Float64 for numeric input, Date for Date, DateTime for DateTime.
Example¶
SELECT quantileTDigestWeighted(number, 1) FROM numbers(10)
Result:
┌─quantileTDigestWeighted(number, 1)─┐ │ 4.5 │ └────────────────────────────────────┘
quantileTiming¶
Computes the quantile of a numeric data sequence with determined precision, optimized for time-related distributions (e.g., page loading times). The result is deterministic.
Syntax¶
quantileTiming(level)(expr)
Alias: medianTiming
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: AFloat*expression representing time values. Negative values lead to undefined behavior. Values greater than 30,000 are treated as 30,000.
Returns¶
The quantile of the specified level as a Float32. Returns NaN if no values are passed.
Example¶
WITH data AS (SELECT arrayJoin([72, 112, 126, 145, 104, 242, 313, 168, 108]) AS response_time) SELECT quantileTiming(response_time) FROM data
Result:
┌─quantileTiming(response_time)─┐ │ 126 │ └───────────────────────────────┘
quantileTimingWeighted¶
Computes the quantile of a numeric data sequence with determined precision, considering the weight of each element, optimized for time-related distributions. The result is deterministic.
Syntax¶
quantileTimingWeighted(level)(expr, weight)
Alias: medianTimingWeighted
Arguments¶
level: An optionalFloat64parameter (0 to 1) specifying the quantile level. Default: 0.5.expr: AFloat*expression representing time values. Negative values lead to undefined behavior. Values greater than 30,000 are treated as 30,000.weight: A numeric column representing the weight of eachexprvalue.
Returns¶
The quantile of the specified level as a Float32. Returns NaN if no values are passed.
Example¶
WITH data AS (SELECT arrayJoin([(68,1), (104,2), (112,3), (126,2), (138,1), (162,1)]) AS pair) SELECT quantileTimingWeighted(pair.1, pair.2) FROM data
Result:
┌─quantileTimingWeighted(pair.1, pair.2)─┐ │ 112 │ └────────────────────────────────────────┘
quantilesTimingWeighted¶
Computes multiple quantiles of a numeric data sequence with determined precision, considering the weight of each element, optimized for time-related distributions. It returns an array of results for the specified levels.
Syntax¶
quantilesTimingWeighted(level1, level2, ...)(expr, weight)
Arguments¶
level1, level2, ...: One or moreFloat64parameters (0 to 1) specifying the quantile levels.expr: AFloat*expression representing time values.weight: A numeric column representing the weight of eachexprvalue.
Returns¶
An array of quantiles for the specified levels. Array(Float32).
Example¶
WITH data AS (SELECT arrayJoin([(68,1), (104,2), (112,3), (126,2), (138,1), (162,1)]) AS pair) SELECT quantilesTimingWeighted(0.5, 0.99)(pair.1, pair.2) FROM data
Result:
┌─quantilesTimingWeighted(0.5, 0.99)(pair.1, pair.2)─┐ │ [112,162] │ └────────────────────────────────────────────────────┘
rankCorr¶
Computes a rank correlation coefficient between two numeric variables, measuring the monotonic relationship between them.
Syntax¶
rankCorr(x, y)
Arguments¶
x: The first numeric variable (Float32orFloat64).y: The second numeric variable (Float32orFloat64).
Returns¶
A Float64 rank correlation coefficient between -1 and +1.
Example¶
SELECT rankCorr(number, number) FROM numbers(10)
Result:
┌─rankCorr(number, number)─┐ │ 1 │ └──────────────────────────┘
simpleLinearRegression¶
Performs a simple linear regression to find the best-fit line y = k*x + b for a set of (x, y) data points.
Syntax¶
simpleLinearRegression(x, y)
Arguments¶
x: The column of independent variable values.y: The column of dependent variable values.
Returns¶
A tuple containing the slope k and the y-intercept b of the regression line. Tuple(Float64, Float64).
Example¶
SELECT simpleLinearRegression(x, y)
FROM values('x Int64, y Int64', (0, 0), (1, 1), (2, 2), (3, 3))
Result:
┌─simpleLinearRegression(x, y)─┐ │ (1,0) │ └──────────────────────────────┘
singleValueOrNull¶
Checks if there is exactly one unique non-NULL value in a group. If so, it returns that value; otherwise, it returns NULL.
Syntax¶
singleValueOrNull(x)
Arguments¶
x: A column of any data type (excluding Map, Array, or Tuple).
Returns¶
The unique non-NULL value if exactly one exists, otherwise NULL. The return type matches the x type, but is Nullable.
Example¶
SELECT singleValueOrNull(col)
FROM values('col String', 'A', 'A', 'A')
Result:
┌─singleValueOrNull(col)─┐ │ A │ └────────────────────────┘
skewPop¶
Computes the population skewness of a sequence of numbers, which measures the asymmetry of the probability distribution.
Syntax¶
skewPop(expr)
Arguments¶
expr: A numeric expression.
Returns¶
The population skewness as a Float64.
Example¶
SELECT skewPop(number) FROM numbers(10)
Result:
┌─skewPop(number)─┐ │ 0 │ └─────────────────┘
skewSamp¶
Computes the sample skewness of a sequence of numbers, providing an unbiased estimate of the skewness of a random variable from a sample.
Syntax¶
skewSamp(expr)
Arguments¶
expr: A numeric expression.
Returns¶
The sample skewness as a Float64. Returns NaN if the sample size is 1 or less.
Example¶
SELECT skewSamp(number) FROM numbers(10)
Result:
┌─skewSamp(number)─┐ │ 0 │ └──────────────────┘
sparkbar¶
Generates a frequency histogram as a string visualization for values x and their frequencies y over a specified interval.
Syntax¶
sparkbar(buckets[, min_x, max_x])(x, y)
Parameters¶
buckets: TheUInt64number of segments (bars) in the histogram.min_x: Optional. The minimum value for the x-axis interval.max_x: Optional. The maximum value for the x-axis interval.
Arguments¶
x: The values to categorize.y: The frequency or count for eachxvalue.
Returns¶
A String representing the sparkbar histogram.
Example¶
SELECT sparkbar(5)(number, 1) FROM numbers(10)
Result:
┌─sparkbar(5)(number, 1)─┐ │ █▊▊▊▊ │ └────────────────────────┘
stddevPop¶
Calculates the population standard deviation, which is the square root of the population variance. This function uses a numerically unstable algorithm.
Syntax¶
stddevPop(x)
Aliases: STD, STDDEV_POP
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The population standard deviation as a Float64.
Example¶
SELECT stddevPop(number) FROM numbers(10)
Result:
┌─stddevPop(number)─┐ │ 2.8722813232690143 │ └───────────────────┘
stddevPopStable¶
Calculates the population standard deviation using a numerically stable algorithm. This function is generally slower but provides more accurate results than stddevPop.
Syntax¶
stddevPopStable(x)
Alias: VAR_POP_STABLE
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The population standard deviation as a Float64.
Example¶
SELECT stddevPopStable(number) FROM numbers(10)
Result:
┌─stddevPopStable(number)─┐ │ 2.8722813232690143 │ └─────────────────────────┘
stddevSamp¶
Calculates the sample standard deviation, which is the square root of the sample variance. This function uses a numerically unstable algorithm.
Syntax¶
stddevSamp(x)
Alias: STDDEV_SAMP
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sample standard deviation as a Float64.
Example¶
SELECT stddevSamp(number) FROM numbers(10)
Result:
┌─stddevSamp(number)─┐ │ 3.0276503540974917 │ └────────────────────┘
stddevSampStable¶
Calculates the sample standard deviation using a numerically stable algorithm. This function is generally slower but provides more accurate results than stddevSamp.
Syntax¶
stddevSampStable(x)
Alias: VAR_SAMP_STABLE
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sample standard deviation as a Float64.
Example¶
SELECT stddevSampStable(number) FROM numbers(10)
Result:
┌─stddevSampStable(number)─┐ │ 3.0276503540974917 │ └──────────────────────────┘
studentTTest¶
Applies Student's t-test to compare two samples and determine if their population means are equal, assuming normal distribution and equal variances.
Syntax¶
studentTTest([confidence_level])(sample_data, sample_index)
Parameters¶
confidence_level: An optionalFloat64confidence level for calculating confidence intervals.
Arguments¶
sample_data: The numeric values from both samples (Integer, Float, or Decimal).sample_index: AnInteger(0 or 1) indicating which sample thesample_databelongs to.
Returns¶
A tuple containing the calculated t-statistic and p-value. If confidence_level is specified, it also includes the lower and upper bounds of the confidence interval. Tuple(Float64, Float64) or Tuple(Float64, Float64, Float64, Float64).
Example¶
WITH student_ttest AS ( SELECT arrayJoin([(20.3,0), (21.1,0), (21.9,1), (21.7,0), (19.9,1), (21.8,1)]) AS data_pair ) SELECT studentTTest(data_pair.1, data_pair.2) FROM student_ttest
Result:
┌─studentTTest(data_pair.1, data_pair.2)────┐ │ (-0.21739130434783777,0.8385421208415731) │ └───────────────────────────────────────────┘
sum¶
Calculates the sum of numeric values within a group.
Syntax¶
sum(num)
Arguments¶
num: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sum of the values. The return type matches the num type.
Example¶
SELECT sum(number) FROM numbers(10)
Result:
┌─sum(number)─┐ │ 45 │ └─────────────┘
sumCount¶
Calculates both the sum of numeric values and the count of non-NULL rows within a group.
Syntax¶
sumCount(x)
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
A tuple containing the sum of numbers and the count of non-NULL values. Tuple(T_sum, UInt64).
Example¶
SELECT sumCount(number) FROM numbers(10)
Result:
┌─sumCount(number)─┐ │ (45,10) │ └──────────────────┘
sumKahan¶
Calculates the sum of numeric values using the Kahan compensated summation algorithm, which provides higher precision for floating-point numbers than a simple sum.
Syntax¶
sumKahan(x)
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sum of the values, with a type matching the input x.
Example¶
SELECT sum(0.1), sumKahan(0.1) FROM numbers(10)
Result:
┌───────────sum(0.1)─┬─sumKahan(0.1)─┐ │ 0.9999999999999999 │ 1 │ └────────────────────┴───────────────┘
sumMap¶
Totals values for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding summed values, without overflow.
Syntax¶
sumMap(key, value) sumMap(Tuple(key, value))
Alias: sumMappedArrays
Arguments¶
key: An array of keys.value: An array of values. The number of elements inkeyandvaluearrays must be the same for each row.
Returns¶
A tuple containing two arrays: sorted keys and their summed values. Tuple(Array(T_key), Array(T_value)).
Example¶
SELECT sumMap(a, b)
FROM values('a Array(String), b Array(Int64)', (['x', 'y'], [2, 2]), (['y', 'z'], [3, 1]))
Result:
┌─sumMap(a, b)───────────┐ │ [['x','y','z'],[2,5,1]]│ └────────────────────────┘
sumMapWithOverflow¶
Totals values for each key across multiple maps within a group. It returns a tuple of two arrays: sorted keys and their corresponding summed values, allowing for overflow if the sum exceeds the data type's maximum.
Syntax¶
sumMapWithOverflow(key, value) sumMapWithOverflow(Tuple(key, value))
Arguments¶
key: An array of keys.value: An array of values. The number of elements inkeyandvaluearrays must be the same for each row.
Returns¶
A tuple containing two arrays: sorted keys and their summed values. Tuple(Array(T_key), Array(T_value)).
Example¶
SELECT sumMapWithOverflow(a, b)
FROM values('a Array(String), b Array(Int8)', (['x', 'y'], [100, 50]), (['y', 'z'], [80, 20]))
Result:
┌─sumMapWithOverflow(a, b)───────────┐ │ [['x','y','z'],[100,-126,20]] │ └────────────────────────────────────┘
sumWithOverflow¶
Calculates the sum of numeric values within a group, using the same data type for the result as the input. If the sum exceeds the maximum value for that data type, it will overflow.
Syntax¶
sumWithOverflow(num)
Arguments¶
num: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sum of the values, with a type matching the input num.
Example¶
SELECT sumWithOverflow(toUInt8(number)) FROM numbers(200)
Result:
┌─sumWithOverflow(toUInt8(number))─┐ │ 104 │ └──────────────────────────────────┘
theilsU¶
Calculates Theil's U uncertainty coefficient, a measure of association between two categorical variables. The value ranges from -1.0 (perfect negative association) to +1.0 (perfect positive association), with 0.0 indicating no association.
Syntax¶
theilsU(column1, column2)
Arguments¶
column1: The first categorical column.column2: The second categorical column.
Returns¶
A Float64 value between -1 and 1.
Example¶
SELECT
theilsU(a ,b)
FROM
(
SELECT
number % 10 AS a,
number % 4 AS b
FROM
numbers(150)
)
Result:
┌────────theilsU(a, b)─┐ │ -0.30195720557678846 │ └──────────────────────┘
topK¶
Returns an array of the approximately most frequent values in a specified column, sorted by approximate frequency in descending order. It uses the Filtered Space-Saving algorithm.
Syntax¶
topK(N)(column) topK(N, load_factor)(column) topK(N, load_factor, 'counts')(column)
Parameters¶
N: The maximumUInt64number of elements to return. Optional. Default: 10.load_factor: AUInt64factor determining the number of cells reserved for values. Optional. Default: 3.'counts': AStringliteral to include approximate counts and error values in the result.
Arguments¶
column: The column whose value frequency is to be calculated.
Returns¶
An array of the most frequent values. If 'counts' is specified, it returns an array of tuples (value, approximate_count, error). Array(T) or Array(Tuple(T, UInt64, UInt64)).
Example¶
SELECT topK(3)(number % 5) FROM numbers(100)
Result:
┌─topK(3)(modulo(number, 5))─┐ │ [0,1,2] │ └────────────────────────────┘
topKWeighted¶
Returns an array of the approximately most frequent values in a specified column, considering a weight for each value. The result is sorted by approximate weighted frequency in descending order.
Syntax¶
topKWeighted(N)(column, weight) topKWeighted(N, load_factor)(column, weight) topKWeighted(N, load_factor, 'counts')(column, weight)
Parameters¶
N: The maximumUInt64number of elements to return. Optional. Default: 10.load_factor: AUInt64factor determining the number of cells reserved for values. Optional. Default: 3.'counts': AStringliteral to include approximate counts and error values in the result.
Arguments¶
column: The column whose value frequency is to be calculated.weight: AUInt64value representing the weight for eachcolumnentry.
Returns¶
An array of the values with the maximum approximate sum of weights. If 'counts' is specified, it returns an array of tuples (value, approximate_weighted_count, error). Array(T) or Array(Tuple(T, UInt64, UInt64)).
Example¶
SELECT topKWeighted(2)(k, w) FROM
VALUES('k String, w UInt64', ('y', 1), ('y', 1), ('x', 5), ('y', 1), ('z', 10))
Result:
┌─topKWeighted(2)(k, w)──┐ │ ['z','x'] │ └────────────────────────┘
uniq¶
Calculates the approximate number of distinct values for an argument using an adaptive sampling algorithm. This function is generally recommended for most scenarios due to its balance of accuracy and performance.
Syntax¶
uniq(x[, ...])
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The approximate number of distinct values as a UInt64.
Example¶
SELECT uniq(number % 10) FROM numbers(100)
Result:
┌─uniq(modulo(number, 10))─┐ │ 10 │ └──────────────────────────┘
uniqCombined¶
Calculates the approximate number of distinct argument values using a combination of array, hash table, and HyperLogLog algorithms. It offers better memory efficiency and accuracy than uniq but can be slightly slower.
Syntax¶
uniqCombined(HLL_precision)(x[, ...])
Parameters¶
HLL_precision: An optionalUInt64(base-2 logarithm of HyperLogLog cells). Default: 17.
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The approximate number of distinct values as a UInt64.
Example¶
SELECT uniqCombined(number) FROM numbers(1e6)
Result:
┌─uniqCombined(number)─┐ │ 1001148 │ └──────────────────────┘
uniqCombined64¶
Calculates the approximate number of distinct argument values, similar to uniqCombined, but uses a 64-bit hash for all data types. This provides higher accuracy for very large cardinalities, especially for non-string types.
Syntax¶
uniqCombined64(HLL_precision)(x[, ...])
Parameters¶
HLL_precision: An optionalUInt64(base-2 logarithm of HyperLogLog cells). Default: 17.
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The approximate number of distinct values as a UInt64.
Example¶
SELECT uniqCombined64(number) FROM numbers(1e6)
Result:
┌─uniqCombined64(number)─┐ │ 1000000 │ └────────────────────────┘
uniqExact¶
Calculates the exact number of distinct argument values. This function consumes O(n) memory, where n is the number of distinct values, and should be used when absolute precision is required.
Syntax¶
uniqExact(x[, ...])
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The exact number of distinct values as a UInt64.
Example¶
SELECT uniqExact(number % 10) FROM numbers(100)
Result:
┌─uniqExact(modulo(number, 10))─┐ │ 10 │ └───────────────────────────────┘
uniqHLL12¶
Calculates the approximate number of distinct argument values using the HyperLogLog algorithm with 2^12 5-bit cells. This function is generally not recommended for new development due to its lower accuracy compared to uniq or uniqCombined.
Syntax¶
uniqHLL12(x[, ...])
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The approximate number of distinct values as a UInt64.
Example¶
SELECT uniqHLL12(number) FROM numbers(1000)
Result:
┌─uniqHLL12(number)─┐ │ 999 │ └───────────────────┘
uniqTheta¶
Calculates the approximate number of distinct argument values using the Theta Sketch Framework (KMV algorithm). It provides a relative error of 3.125% (95% confidence).
Syntax¶
uniqTheta(x[, ...])
Arguments¶
x: One or more expressions or columns (Tuple, Array, Date, DateTime, String, or numeric types).
Returns¶
The approximate number of distinct values as a UInt64.
Example¶
SELECT uniqTheta(number) FROM numbers(1000)
Result:
┌─uniqTheta(number)─┐ │ 1000 │ └───────────────────┘
varPop¶
Calculates the population variance of a set of numeric values. This function uses a numerically unstable algorithm.
Syntax¶
varPop(x)
Alias: VAR_POP
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The population variance as a Float64.
Example¶
SELECT varPop(number) FROM numbers(10)
Result:
┌─varPop(number)─┐ │ 8.25 │ └────────────────┘
varPopStable¶
Calculates the population variance of a set of numeric values using a numerically stable algorithm. This function is generally slower but provides more accurate results than varPop.
Syntax¶
varPopStable(x)
Alias: VAR_POP_STABLE
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The population variance as a Float64.
Example¶
SELECT varPopStable(number) FROM numbers(10)
Result:
┌─varPopStable(number)─┐ │ 8.25 │ └──────────────────────┘
varSamp¶
Calculates the sample variance of a set of numeric values. This function uses a numerically unstable algorithm.
Syntax¶
varSamp(x)
Alias: VAR_SAMP
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sample variance as a Float64.
Example¶
SELECT varSamp(number) FROM numbers(10)
Result:
┌─varSamp(number)─┐ │ 9.166666666666666 │ └─────────────────┘
varSampStable¶
Calculates the sample variance of a set of numeric values using a numerically stable algorithm. This function is generally slower but provides more accurate results than varSamp.
Syntax¶
varSampStable(x)
Alias: VAR_SAMP_STABLE
Arguments¶
x: A numeric expression (Integer, Float, or Decimal).
Returns¶
The sample variance as a Float64.
Example¶
SELECT varSampStable(number) FROM numbers(10)
Result:
┌─varSampStable(number)─┐ │ 9.166666666666666 │ └───────────────────────┘
welchTTest¶
Applies Welch's t-test to compare two samples and determine if their population means are equal, without assuming equal variances.
Syntax¶
welchTTest([confidence_level])(sample_data, sample_index)
Parameters¶
confidence_level: An optionalFloat64confidence level for calculating confidence intervals.
Arguments¶
sample_data: The numeric values from both samples (Integer, Float, or Decimal).sample_index: AnInteger(0 or 1) indicating which sample thesample_databelongs to.
Returns¶
A tuple containing the calculated t-statistic and p-value. If confidence_level is specified, it also includes the lower and upper bounds of the confidence interval. Tuple(Float64, Float64) or Tuple(Float64, Float64, Float64, Float64).
Example¶
WITH welch_ttest AS ( SELECT arrayJoin([(20.3,0), (22.1,0), (21.9,0), (18.9,1), (20.3,1), (19.0,1)]) AS data_pair ) SELECT welchTTest(data_pair.1, data_pair.2) FROM welch_ttest
Result:
┌─welchTTest(data_pair.1, data_pair.2)──────┐ │ (2.7988719532211235,0.051807360348581945) │ └───────────────────────────────────────────┘