Functions for splitting strings¶

The following functions are used to split strings into substrings.

splitByChar¶

Splits a string into an array of substrings using a single specified character as the delimiter. Empty substrings are included if the separator appears at the beginning, end, or multiple times consecutively.

Syntax¶

splitByChar(separator, s[, max_substrings])

Arguments¶

separator: String. The single character used to divide the string.
s: String. The input string to be split.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, representing the split substrings. Array(String).

Example¶

SELECT splitByChar(',', '1,2,3,abcde')

Result:

┌─splitByChar(',', '1,2,3,abcde')─┐
│ ['1','2','3','abcde']           │
└─────────────────────────────────┘

splitByString¶

Divides a string into an array of substrings using a specified string as the delimiter. If the separator string is empty, the function splits the input string into individual characters.

Syntax¶

splitByString(separator, s[, max_substrings])

Arguments¶

separator: String. The string used to divide the input string.
s: String. The input string to be split.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, representing the split substrings. Array(String).

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Example¶

SELECT splitByString(', ', '1, 2 3, 4,5, abcde')

Result:

┌─splitByString(', ', '1, 2 3, 4,5, abcde')─┐
│ ['1','2 3','4,5','abcde']                 │
└───────────────────────────────────────────┘

SELECT splitByString('', 'abcde')

Result:

┌─splitByString('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByRegexp¶

Splits a string into an array of substrings based on matches to a regular expression. If the regular expression is empty, the string is split into individual characters. If no match is found, the string is not split.

Syntax¶

splitByRegexp(regexp, s[, max_substrings])

Arguments¶

regexp: String or FixedString. The regular expression used as the delimiter. Must be a constant.
s: String. The input string to be split.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, representing the split substrings. Array(String).

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Example¶

SELECT splitByRegexp('\\d+', 'a12bc23de345f')

Result:

┌─splitByRegexp('\\d+', 'a12bc23de345f')─┐
│ ['a','bc','de','f']                    │
└────────────────────────────────────────┘

SELECT splitByRegexp('', 'abcde')

Result:

┌─splitByRegexp('', 'abcde')─┐
│ ['a','b','c','d','e']      │
└────────────────────────────┘

splitByWhitespace¶

Splits a string into an array of substrings, using any sequence of whitespace characters as the delimiter.

Syntax¶

splitByWhitespace(s[, max_substrings])

Arguments¶

s: String. The input string to be split.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, representing the split substrings. Array(String).

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Example¶

SELECT splitByWhitespace('  1!  a,  b.  ')

Result:

┌─splitByWhitespace('  1!  a,  b.  ')─┐
│ ['1!','a,','b.']                    │
└─────────────────────────────────────┘

splitByNonAlpha¶

Splits a string into an array of substrings, using any non-alphabetic characters (whitespace or punctuation) as delimiters.

Syntax¶

splitByNonAlpha(s[, max_substrings])

Arguments¶

s: String. The input string to be split.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, representing the split substrings. Array(String).

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Example¶

SELECT splitByNonAlpha('  1!  a,  b.  ')

Result:

┌─splitByNonAlpha('  1!  a,  b.  ')─┐
│ ['1','a','b']                     │
└───────────────────────────────────┘

arrayStringConcat¶

Concatenates all elements of an array into a single string, optionally separated by a specified delimiter.

Syntax¶

arrayStringConcat(arr[, separator])

Arguments¶

arr: Array. The array of values to concatenate.
separator: String. An optional string to insert between array elements. Defaults to an empty string if not provided.

Returns¶

A single string resulting from the concatenation. String.

Example¶

SELECT arrayStringConcat(['12/05/2021', '12:50:00'], ' ') AS DateString

Result:

┌─DateString──────────┐
│ 12/05/2021 12:50:00 │
└─────────────────────┘

alphaTokens¶

Extracts sequences of consecutive alphabetic characters (a-z, A-Z) from a string into an array of substrings.

Syntax¶

alphaTokens(s[, max_substrings])

Alias: splitByAlpha

Arguments¶

s: String. The input string to extract tokens from.
max_substrings: Int64. An optional parameter. If greater than 0, the function returns at most this many substrings. Otherwise, it returns all possible substrings.

Returns¶

An array of strings, containing only the alphabetic tokens. Array(String).

Setting splitby_max_substrings_includes_remaining_string (default: 0) controls if the remaining string is included in the last element of the result array when argument max_substrings > 0.

Example¶

SELECT alphaTokens('abca1abc')

Result:

┌─alphaTokens('abca1abc')─┐
│ ['abca','abc']          │
└─────────────────────────┘

extractAllGroups¶

Extracts all non-overlapping matches of a regular expression from a string, returning each capturing group as a separate element within nested arrays.

Syntax¶

extractAllGroups(text, regexp)

Arguments¶

text: String or FixedString. The input string to search within.
regexp: String or FixedString. The regular expression to match. Must be a constant.

Returns¶

An array of arrays of strings. Each inner array contains the captured groups for a single match. If no matches are found, an empty array is returned. Array(Array(String)).

Example¶

SELECT extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')

Result:

┌─extractAllGroups('abc=123, 8="hkl"', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','123'],['8','"hkl"']]                                         │
└───────────────────────────────────────────────────────────────────────┘

ngrams¶

Splits a UTF-8 encoded string into an array of n-grams, where each n-gram is a substring of a specified length.

Syntax¶

ngrams(string, ngramsize)

Arguments¶

string: String or FixedString. The input string to generate n-grams from.
ngramsize: UInt. The desired length of each n-gram.

Returns¶

An array of strings, where each string is an n-gram. Array(String).

Example¶

SELECT ngrams('Tinybird', 3)

Result:

┌─ngrams('Tinybird', 3)─────────────────┐
│ ['Tin','iny','nyb','ybi','bir','ird'] │
└───────────────────────────────────────┘

tokens¶

Splits a string into an array of tokens, using any non-alphanumeric ASCII characters as delimiters.

Syntax¶

tokens(input_string)

Arguments¶

input_string: String. The input string to tokenize.

Returns¶

An array of strings, representing the extracted tokens. Array.

Example¶

SELECT tokens('test1,;\\ test2,;\\ test3,;\\   test4') AS tokens

Result:

┌─tokens────────────────────────────┐
│ ['test1','test2','test3','test4'] │
└───────────────────────────────────┘