Functions for searching in strings

All functions in this section search case-sensitively by default. Case-insensitive search is usually provided by separate function variants.

Functions in this section also assume that the searched string, referred to in this section as haystack, and the search string, referred to in this section as needle, are single-byte encoded text. If this assumption is violated, no exception is thrown and results are undefined. Search with UTF-8 encoded strings is usually provided by separate function variants. Likewise, if a UTF-8 function variant is used and the input strings aren't UTF-8 encoded text, no exception is thrown and the results are undefined. Note that no automatic Unicode normalization is performed, however you can use the normalizeUTF8*() functions for that.

General strings functions and functions for replacing in strings are described separately.

position

Finds the starting position of a substring within a string. The search is case-sensitive and positions are 1-based.

Syntax

position(haystack, needle[, start_pos])

Alias:

  • position(needle IN haystack)

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • needle: Substring to be searched. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The 1-based starting position of the needle if found. Returns 0 if the needle is not found. UInt64.

If substring needle is empty, these rules apply:

  • if no start_pos was specified: return 1
  • if start_pos = 0: return 1
  • if start_pos >= 1 and start_pos <= length(haystack) + 1: return start_pos
  • otherwise: return 0

The same rules also apply to functions locate, positionCaseInsensitive, positionUTF8 and positionCaseInsensitiveUTF8.

Example

SELECT position('Hello, world!', '!')

Result:

┌─position('Hello, world!', '!')─┐
│                             13 │
└────────────────────────────────┘

Example with start_pos argument:

SELECT
    position('Hello, world!', 'o', 1),
    position('Hello, world!', 'o', 7)

Result:

┌─position('Hello, world!', 'o', 1)─┬─position('Hello, world!', 'o', 7)─┐
│                                 5 │                                 9 │
└───────────────────────────────────┴───────────────────────────────────┘

Example for needle IN haystack syntax:

SELECT 6 = position('/' IN s) FROM (SELECT 'Hello/World' AS s)

Result:

┌─equals(6, position(s, '/'))─┐
│                           1 │
└─────────────────────────────┘

Examples with empty needle substring:

SELECT
    position('abc', ''),
    position('abc', '', 0),
    position('abc', '', 1),
    position('abc', '', 2),
    position('abc', '', 3),
    position('abc', '', 4),
    position('abc', '', 5)

Result:

┌─position('abc', '')─┬─position('abc', '', 0)─┬─position('abc', '', 1)─┬─position('abc', '', 2)─┬─position('abc', '', 3)─┬─position('abc', '', 4)─┬─position('abc', '', 5)─┐
│                   1 │                      1 │                      1 │                      2 │                      3 │                      4 │                      0 │
└─────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┴────────────────────────┘

locate

This function is similar to position, but its arguments are ordered differently, with the needle appearing before the haystack.

Syntax

locate(needle, haystack[, start_pos])

Arguments

  • needle: Substring to be searched. String.
  • haystack: String in which the search is performed. String or Enum.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The 1-based starting position of the needle if found. Returns 0 if the needle is not found. UInt64.

Example

SELECT locate('world', 'Hello, world!')

Result:

┌─locate('world', 'Hello, world!')─┐
│                                8 │
└──────────────────────────────────┘

positionCaseInsensitive

Finds the starting position of a substring within a string, ignoring case differences. Positions are 1-based.

Syntax

positionCaseInsensitive(haystack, needle[, start_pos])

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • needle: Substring to be searched. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The 1-based starting position of the needle if found. Returns 0 if the needle is not found. UInt64.

Example

SELECT positionCaseInsensitive('Hello, world!', 'hello')

Result:

┌─positionCaseInsensitive('Hello, world!', 'hello')─┐
│                                                 1 │
└───────────────────────────────────────────────────┘

positionUTF8

Finds the starting position of a substring within a UTF-8 encoded string. This function correctly handles multi-byte UTF-8 characters.

Syntax

positionUTF8(haystack, needle[, start_pos])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String or Enum.
  • needle: UTF-8 encoded substring to be searched. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The 1-based starting position of the needle if found. Returns 0 if the needle is not found. UInt64.

Example

Function positionUTF8 correctly counts character ö (represented by two points) as a single Unicode codepoint:

SELECT positionUTF8('Motörhead', 'r')

Result:

┌─position('Motörhead', 'r')─┐
│                          5 │
└────────────────────────────┘

positionCaseInsensitiveUTF8

Finds the starting position of a substring within a UTF-8 encoded string, ignoring case differences.

Syntax

positionCaseInsensitiveUTF8(haystack, needle[, start_pos])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String or Enum.
  • needle: UTF-8 encoded substring to be searched. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The 1-based starting position of the needle if found. Returns 0 if the needle is not found. UInt64.

Example

SELECT positionCaseInsensitiveUTF8('Motörhead', 'HEAD')

Result:

┌─positionCaseInsensitiveUTF8('Motörhead', 'HEAD')─┐
│                                                6 │
└──────────────────────────────────────────────────┘

multiSearchAllPositions

Searches for multiple substrings within a string and returns an array containing the 1-based starting position for each needle.

All multiSearch*() functions only support up to 2<sup>8</sup> needles.

Syntax

multiSearchAllPositions(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

An array of UInt64 values. Each value represents the 1-based starting position of the corresponding needle in the haystack. A value of 0 indicates the needle was not found.

Example

SELECT multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])

Result:

┌─multiSearchAllPositions('Hello, World!', ['hello', '!', 'world'])─┐
│ [0,13,0]                                                          │
└───────────────────────────────────────────────────────────────────┘

multiSearchAllPositionsCaseInsensitive

Searches for multiple substrings within a string, ignoring case, and returns an array containing the 1-based starting position for each needle.

Syntax

multiSearchAllPositionsCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

An array of UInt64 values. Each value represents the 1-based starting position of the corresponding needle in the haystack. A value of 0 indicates the needle was not found.

Example

SELECT multiSearchAllPositionsCaseInsensitive('Tinybird',['t','b'])

Result:

[1,5]

multiSearchAllPositionsUTF8

Searches for multiple UTF-8 encoded substrings within a UTF-8 encoded string and returns an array containing the 1-based starting position for each needle.

Syntax

multiSearchAllPositionsUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

An array of UInt64 values. Each value represents the 1-based starting position of the corresponding needle in the haystack. A value of 0 indicates the needle was not found.

Example

Given Tinybird as a UTF-8 string, find the positions of T (\x54) and b (\x62).

SELECT multiSearchAllPositionsUTF8('\x54\x69\x6e\x79\x62\x69\x72\x64',['\x54','\x62'])

Result:

[1,5]

multiSearchAllPositionsCaseInsensitiveUTF8

Searches for multiple UTF-8 encoded substrings within a UTF-8 encoded string, ignoring case, and returns an array containing the 1-based starting position for each needle.

Syntax

multiSearchAllPositionsCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

An array of UInt64 values. Each value represents the 1-based starting position of the corresponding needle in the haystack. A value of 0 indicates the needle was not found.

Example

Given Tinybird as a UTF-8 string, find the positions of t (\x74) and B (\x42).

SELECT multiSearchAllPositionsCaseInsensitiveUTF8('\x54\x69\x6e\x79\x62\x69\x72\x64',['\x74','\x42'])

Result:

[1,5]

multiSearchFirstPosition

Finds the leftmost starting position of any of the provided needle substrings within a haystack string.

Functions multiSearchFirstPositionCaseInsensitive, multiSearchFirstPositionUTF8 and multiSearchFirstPositionCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchFirstPosition(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

The 1-based starting position of the first matching needle found. Returns 0 if none of the needle substrings are found. UInt64.

Example

SELECT multiSearchFirstPosition('Hello World',['llo', 'Wor', 'ld'])

Result:

3

multiSearchFirstPositionCaseInsensitive

Finds the leftmost starting position of any of the provided needle substrings within a haystack string, ignoring case.

Syntax

multiSearchFirstPositionCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

The 1-based starting position of the first matching needle found, ignoring case. Returns 0 if none of the needle substrings are found. UInt64.

Example

SELECT multiSearchFirstPositionCaseInsensitive('HELLO WORLD',['wor', 'ld', 'ello'])

Result:

2

multiSearchFirstPositionUTF8

Finds the leftmost starting position of any of the provided UTF-8 encoded needle substrings within a UTF-8 encoded haystack string.

Syntax

multiSearchFirstPositionUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

The 1-based starting position of the first matching needle found. Returns 0 if none of the needle substrings are found. UInt64.

Example

Find the leftmost offset in UTF-8 string hello world which matches any of the given needles.

SELECT multiSearchFirstPositionUTF8('\x68\x65\x6c\x6c\x6f\x20\x77\x6f\x72\x6c\x64',['wor', 'ld', 'ello'])

Result:

2

multiSearchFirstPositionCaseInsensitiveUTF8

Finds the leftmost starting position of any of the provided UTF-8 encoded needle substrings within a UTF-8 encoded haystack string, ignoring case.

Syntax

multiSearchFirstPositionCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

The 1-based starting position of the first matching needle found, ignoring case. Returns 0 if none of the needle substrings are found. UInt64.

Example

Find the leftmost offset in UTF-8 string HELLO WORLD which matches any of the given needles.

SELECT multiSearchFirstPositionCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['wor', 'ld', 'ello'])

Result:

2

multiSearchFirstIndex

Returns the 1-based index of the first needle in the provided array that is found earliest in the haystack string.

Functions multiSearchFirstIndexCaseInsensitive, multiSearchFirstIndexUTF8 and multiSearchFirstIndexCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchFirstIndex(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

The 1-based index of the first needle from the array that matches. Returns 0 if no needle is found. UInt8.

Example

SELECT multiSearchFirstIndex('Hello World',['World','Hello'])

Result:

1

multiSearchFirstIndexCaseInsensitive

Returns the 1-based index of the first needle in the provided array that is found earliest in the haystack string, ignoring case.

Syntax

multiSearchFirstIndexCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

The 1-based index of the first needle from the array that matches, ignoring case. Returns 0 if no needle is found. UInt8.

Example

SELECT multiSearchFirstIndexCaseInsensitive('hElLo WoRlD',['World','Hello'])

Result:

1

multiSearchFirstIndexUTF8

Returns the 1-based index of the first UTF-8 encoded needle in the provided array that is found earliest in the UTF-8 encoded haystack string.

Syntax

multiSearchFirstIndexUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

The 1-based index of the first needle from the array that matches. Returns 0 if no needle is found. UInt8.

Example

Given Hello World as a UTF-8 string, find the first index of UTF-8 strings Hello and World.

SELECT multiSearchFirstIndexUTF8('\x48\x65\x6c\x6c\x6f\x20\x57\x6f\x72\x6c\x64',['\x57\x6f\x72\x6c\x64','\x48\x65\x6c\x6c\x6f'])

Result:

1

multiSearchFirstIndexCaseInsensitiveUTF8

Returns the 1-based index of the first UTF-8 encoded needle in the provided array that is found earliest in the UTF-8 encoded haystack string, ignoring case.

Syntax

multiSearchFirstIndexCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

The 1-based index of the first needle from the array that matches, ignoring case. Returns 0 if no needle is found. UInt8.

Example

Given HELLO WORLD as a UTF-8 string, find the first index of UTF-8 strings hello and world.

SELECT multiSearchFirstIndexCaseInsensitiveUTF8('\x48\x45\x4c\x4c\x4f\x20\x57\x4f\x52\x4c\x44',['\x68\x65\x6c\x6c\x6f','\x77\x6f\x72\x6c\x64'])

Result:

1

multiSearchAny

Checks if any of the provided needle substrings exist within the haystack string.

Functions multiSearchAnyCaseInsensitive, multiSearchAnyUTF8 and multiSearchAnyCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

multiSearchAny(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

Returns 1 if at least one needle is found in the haystack, otherwise returns 0. UInt8.

Example

SELECT multiSearchAny('Tinybird',['C','H'])

Result:

0

multiSearchAnyCaseInsensitive

Checks if any of the provided needle substrings exist within the haystack string, ignoring case.

Syntax

multiSearchAnyCaseInsensitive(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: String in which the search is performed. String.
  • needle: An array of substrings to be searched. Array.

Returns

Returns 1 if at least one needle is found in the haystack (case-insensitive), otherwise returns 0. UInt8.

Example

SELECT multiSearchAnyCaseInsensitive('Tinybird',['t','b'])

Result:

1

multiSearchAnyUTF8

Checks if any of the provided UTF-8 encoded needle substrings exist within the UTF-8 encoded haystack string.

Syntax

multiSearchAnyUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

Returns 1 if at least one needle is found in the haystack, otherwise returns 0. UInt8.

Example

Given Tinybird as a UTF-8 string, check if there are any T (\x54) or b (\x62) letters in the word.

SELECT multiSearchAnyUTF8('\x54\x69\x6e\x79\x62\x69\x72\x64',['\x54','\x62'])

Result:

1

multiSearchAnyCaseInsensitiveUTF8

Checks if any of the provided UTF-8 encoded needle substrings exist within the UTF-8 encoded haystack string, ignoring case.

Syntax

multiSearchAnyCaseInsensitiveUTF8(haystack, [needle1, needle2, ..., needleN])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String.
  • needle: An array of UTF-8 encoded substrings to be searched. Array.

Returns

Returns 1 if at least one needle is found in the haystack (case-insensitive), otherwise returns 0. UInt8.

Example

Given Tinybird as a UTF-8 string, check if there is any letter y(\x79) in the word, ignoring case.

SELECT multiSearchAnyCaseInsensitiveUTF8('\x54\x69\x6e\x79\x62\x69\x72\x64',['\x79'])

Result:

1

match

Determines if a string matches a given regular expression pattern.

Syntax

match(haystack, pattern)

Alias: haystack REGEXP pattern operator

Arguments

  • haystack: The string to search within. String.
  • pattern: The regular expression pattern to match, using re2 regular expression syntax. String.

Returns

Returns 1 if the haystack matches the pattern, otherwise returns 0. UInt8.

Example

SELECT match('Tinybird is great', '.*great')

Result:

┌─match('Tinybird is great', '.*great')─┐
│                                     1 │
└───────────────────────────────────────┘

multiMatchAny

Checks if a string matches any of the provided regular expression patterns.

Functions in the multi[Fuzzy]Match*() family use the Vectorscan library. As such, they are only enabled if ClickHouse is compiled with support for Vectorscan.

To turn off all functions that use Hyperscan, use setting SET allow_hyperscan = 0;.

Due to restrictions of Vectorscan, the length of the haystack string must be less than 2<sup>32</sup> bytes.

Hyperscan is generally vulnerable to regular expression denial of service (ReDoS) attacks. Users are advised to check the provided patterns carefully.

Syntax

multiMatchAny(haystack, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • pattern: An array of regular expression patterns to match, using re2 regular expression syntax. Array.

Returns

Returns 1 if the haystack matches at least one of the patterns, otherwise returns 0. UInt8.

Example

SELECT multiMatchAny('Tinybird is fast', ['slow', 'fast'])

Result:

┌─multiMatchAny('Tinybird is fast', ['slow', 'fast'])─┐
│                                                   1 │
└─────────────────────────────────────────────────────┘

multiMatchAnyIndex

Returns the 1-based index of the first regular expression pattern in the provided array that matches the haystack string.

Syntax

multiMatchAnyIndex(haystack, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • pattern: An array of regular expression patterns to match, using re2 regular expression syntax. Array.

Returns

The 1-based index of the first matching pattern from the array. Returns 0 if no pattern matches. UInt8.

Example

SELECT multiMatchAnyIndex('Tinybird is awesome', ['bad', 'awesome', 'slow'])

Result:

┌─multiMatchAnyIndex('Tinybird is awesome', ['bad', 'awesome', 'slow'])─┐
│                                                                     2 │
└───────────────────────────────────────────────────────────────────────┘

multiMatchAllIndices

Returns an array of all 1-based indices of the regular expression patterns that match the haystack string.

Syntax

multiMatchAllIndices(haystack, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • pattern: An array of regular expression patterns to match, using re2 regular expression syntax. Array.

Returns

An array of UInt8 values, representing the 1-based indices of all matching patterns. The order of indices is not guaranteed. Array(UInt8).

Example

SELECT multiMatchAllIndices('Tinybird is fast and scalable', ['fast', 'slow', 'scalable'])

Result:

┌─multiMatchAllIndices('Tinybird is fast and scalable', ['fast', 'slow', 'scalable'])─┐
│ [1,3]                                                                               │
└─────────────────────────────────────────────────────────────────────────────────────┘

multiFuzzyMatchAny

Checks if a string matches any of the provided regular expression patterns within a specified edit distance.

multiFuzzyMatch*() function family do not support UTF-8 regular expressions (it treats them as a sequence of bytes) due to restrictions of Hyperscan.

Syntax

multiFuzzyMatchAny(haystack, distance, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • distance: The maximum allowed edit distance for a match. UInt.
  • pattern: An array of regular expression patterns to match. Array.

Returns

Returns 1 if the haystack matches at least one pattern within the given distance, otherwise returns 0. UInt8.

Example

SELECT multiFuzzyMatchAny('Tinybird', 1, ['Tynybird', 'Tinybirdz'])

Result:

┌─multiFuzzyMatchAny('Tinybird', 1, ['Tynybird', 'Tinybirdz'])─┐
│                                                            1 │
└──────────────────────────────────────────────────────────────┘

multiFuzzyMatchAnyIndex

Returns the 1-based index of the first regular expression pattern in the provided array that matches the haystack string within a specified edit distance.

Syntax

multiFuzzyMatchAnyIndex(haystack, distance, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • distance: The maximum allowed edit distance for a match. UInt.
  • pattern: An array of regular expression patterns to match. Array.

Returns

The 1-based index of the first matching pattern from the array within the given distance. Returns 0 if no pattern matches. UInt8.

Example

SELECT multiFuzzyMatchAnyIndex('Tinybird', 1, ['Tynybird', 'Tinybirdz'])

Result:

┌─multiFuzzyMatchAnyIndex('Tinybird', 1, ['Tynybird', 'Tinybirdz'])─┐
│                                                                 1 │
└───────────────────────────────────────────────────────────────────┘

multiFuzzyMatchAllIndices

Returns an array of all 1-based indices of the regular expression patterns that match the haystack string within a specified edit distance.

Syntax

multiFuzzyMatchAllIndices(haystack, distance, [pattern1, pattern2, ..., patternN])

Arguments

  • haystack: The string to search within. String.
  • distance: The maximum allowed edit distance for a match. UInt.
  • pattern: An array of regular expression patterns to match. Array.

Returns

An array of UInt8 values, representing the 1-based indices of all matching patterns within the given distance. The order of indices is not guaranteed. Array(UInt8).

Example

SELECT multiFuzzyMatchAllIndices('Tinybird', 1, ['Tynybird', 'Tinybirdz', 'Tinybird'])

Result:

┌─multiFuzzyMatchAllIndices('Tinybird', 1, ['Tynybird', 'Tinybirdz', 'Tinybird'])─┐
│ [1,2,3]                                                                         │
└─────────────────────────────────────────────────────────────────────────────────┘

extract

Extracts the first substring from haystack that matches the provided regular expression pattern.

Syntax

extract(haystack, pattern)

Arguments

  • haystack: The input string to search. String.
  • pattern: The regular expression pattern to match, using re2 regular expression syntax. String.

Returns

The first substring that matches the pattern. If the pattern contains capturing groups, it returns the match for the first capturing group. Returns an empty string if no match is found. String.

Example

SELECT extract('number: 1, number: 2, number: 3', '\\d+') AS result

Result:

┌─result─┐
│ 1      │
└────────┘

extractAll

Extracts all substrings from haystack that match the provided regular expression pattern.

Syntax

extractAll(haystack, pattern)

Arguments

  • haystack: The input string to search. String.
  • pattern: The regular expression pattern to match, using re2 regular expression syntax. String.

Returns

An array of all substrings that match the pattern. If the pattern contains capturing groups, it returns matches for the first capturing group. Returns an empty array if no matches are found. Array(String).

Example

SELECT extractAll('number: 1, number: 2, number: 3', '\\d+') AS result

Result:

┌─result────────┐
│ ['1','2','3'] │
└───────────────┘

extractAllGroupsHorizontal

Extracts all matches for each capturing group in a regular expression, returning an array of arrays where each inner array contains all matches for a specific group.

This function is slower than extractAllGroupsVertical.

Syntax

extractAllGroupsHorizontal(haystack, pattern)

Arguments

  • haystack: The input string to search. String.
  • pattern: The regular expression pattern with capturing groups, using re2 regular expression syntax. String.

Returns

An array of arrays of strings. The first inner array contains all matches for the first capturing group, the second for the second, and so on. Array(Array(String)).

If haystack doesn't match the pattern regex, an array of empty arrays is returned.

Example

SELECT extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')

Result:

┌─extractAllGroupsHorizontal('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','def','ghi'],['111','222','333']]                                                │
└──────────────────────────────────────────────────────────────────────────────────────────┘

extractGroups

Extracts the first set of matches for all capturing groups in a regular expression.

Syntax

extractGroups(haystack, pattern)

Arguments

  • haystack: The input string to search. String.
  • pattern: The regular expression pattern with capturing groups, using re2 regular expression syntax. String.

Returns

An array of strings, where each element corresponds to the first match of a capturing group. Array(String).

Example

SELECT extractGroups('hello abc=111 world', '("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS result

Result:

┌─result────────┐
│ ['abc','111'] │
└───────────────┘

extractAllGroupsVertical

Extracts all matches for all capturing groups in a regular expression, returning an array of arrays where each inner array represents a complete match of all groups.

Syntax

extractAllGroupsVertical(haystack, pattern)

Arguments

  • haystack: The input string to search. String.
  • pattern: The regular expression pattern with capturing groups, using re2 regular expression syntax. String.

Returns

An array of arrays of strings. Each inner array contains the matches for all capturing groups from a single overall match of the pattern. Array(Array(String)).

If haystack doesn't match the pattern regex, an empty array is returned.

Example

SELECT extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')

Result:

┌─extractAllGroupsVertical('abc=111, def=222, ghi=333', '("[^"]+"|\\w+)=("[^"]+"|\\w+)')─┐
│ [['abc','111'],['def','222'],['ghi','333']]                                            │
└────────────────────────────────────────────────────────────────────────────────────────┘

like

Checks if a string matches a pattern using SQL LIKE syntax, which supports % for any sequence of characters and _ for any single character.

Syntax

like(haystack, pattern)

Alias: haystack LIKE pattern (operator)

Arguments

  • haystack: The string to search within. String.
  • pattern: The LIKE pattern to match. String.

Returns

Returns 1 if the haystack matches the pattern, otherwise returns 0. UInt8.

Example

SELECT like('Tinybird', 'Tiny%bird')

Result:

┌─like('Tinybird', 'Tiny%bird')─┐
│                             1 │
└───────────────────────────────┘

notLike

Checks if a string does NOT match a pattern using SQL LIKE syntax. This is the inverse of the like function.

Syntax

notLike(haystack, pattern)

Alias: haystack NOT LIKE pattern (operator)

Arguments

  • haystack: The string to search within. String.
  • pattern: The LIKE pattern to match. String.

Returns

Returns 1 if the haystack does not match the pattern, otherwise returns 0. UInt8.

Example

SELECT notLike('Tinybird', 'Tiny%house')

Result:

┌─notLike('Tinybird', 'Tiny%house')─┐
│                                 1 │
└───────────────────────────────────┘

ilike

Checks if a string matches a pattern using SQL LIKE syntax, ignoring case differences.

Syntax

ilike(haystack, pattern)

Alias: haystack ILIKE pattern (operator)

Arguments

  • haystack: The string to search within. String.
  • pattern: The LIKE pattern to match. String.

Returns

Returns 1 if the haystack matches the pattern (case-insensitive), otherwise returns 0. UInt8.

Example

SELECT ilike('Tinybird', 'tiny%bird')

Result:

┌─ilike('Tinybird', 'tiny%bird')─┐
│                              1 │
└────────────────────────────────┘

notILike

Checks if a string does NOT match a pattern using SQL LIKE syntax, ignoring case differences. This is the inverse of the ilike function.

Syntax

notILike(haystack, pattern)

Alias: haystack NOT ILIKE pattern (operator)

Arguments

  • haystack: The string to search within. String.
  • pattern: The LIKE pattern to match. String.

Returns

Returns 1 if the haystack does not match the pattern (case-insensitive), otherwise returns 0. UInt8.

Example

SELECT notILike('Tinybird', 'tiny%house')

Result:

┌─notILike('Tinybird', 'tiny%house')─┐
│                                  1 │
└────────────────────────────────────┘

ngramDistance

Calculates the 4-gram distance between two strings, indicating their similarity. A value closer to 0 means higher similarity, while a value closer to 1 means lower similarity.

Functions ngramDistanceCaseInsensitive, ngramDistanceUTF8, ngramDistanceCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

ngramDistance(haystack, needle)

Arguments

  • haystack: The first string for comparison. String.
  • needle: The second string for comparison. String.

Returns

A Float32 value between 0 and 1, representing the normalized 4-gram distance. Float32.

Example

The more similar two strings are to each other, the closer the result will be to 0 (identical).

SELECT ngramDistance('Tinybird','Tinybird!')

Result:

0.06666667

The less similar two strings are to each, the larger the result will be.

SELECT ngramDistance('Tinybird','House')

Result:

0.5555556

ngramDistanceCaseInsensitive

Calculates the 4-gram distance between two strings, ignoring case, to indicate their similarity.

Syntax

ngramDistanceCaseInsensitive(haystack, needle)

Arguments

  • haystack: The first string for comparison. String.
  • needle: The second string for comparison. String.

Returns

A Float32 value between 0 and 1, representing the normalized 4-gram distance, ignoring case. Float32.

Example

With ngramDistance differences in case will affect the similarity value:

SELECT ngramDistance('Tinybird','tinybird')

Result:

0.71428573

With ngramDistanceCaseInsensitive case is ignored so two identical strings differing only in case will now return a low similarity value:

SELECT ngramDistanceCaseInsensitive('Tinybird','tinybird')

Result:

0

ngramDistanceUTF8

Calculates the 4-gram distance between two UTF-8 encoded strings, indicating their similarity.

Syntax

ngramDistanceUTF8(haystack, needle)

Arguments

  • haystack: The first UTF-8 encoded string for comparison. String.
  • needle: The second UTF-8 encoded string for comparison. String.

Returns

A Float32 value between 0 and 1, representing the normalized 4-gram distance for UTF-8 strings. Float32.

Example

SELECT ngramDistanceUTF8('abcde','cde')

Result:

0.5

ngramDistanceCaseInsensitiveUTF8

Calculates the 4-gram distance between two UTF-8 encoded strings, ignoring case, to indicate their similarity.

Syntax

ngramDistanceCaseInsensitiveUTF8(haystack, needle)

Arguments

  • haystack: The first UTF-8 encoded string for comparison. String.
  • needle: The second UTF-8 encoded string for comparison. String.

Returns

A Float32 value between 0 and 1, representing the normalized 4-gram distance for UTF-8 strings, ignoring case. Float32.

Example

SELECT ngramDistanceCaseInsensitiveUTF8('abcde','CDE')

Result:

0.5

ngramSearch

Calculates a non-symmetric n-gram similarity score between two strings, useful for fuzzy string searching. A higher value (closer to 1) suggests a higher likelihood of needle being contained within haystack.

Functions ngramSearchCaseInsensitive, ngramSearchUTF8, ngramSearchCaseInsensitiveUTF8 provide case-insensitive and/or UTF-8 variants of this function.

Syntax

ngramSearch(haystack, needle)

Arguments

  • haystack: The string to search within. String.
  • needle: The string to search for. String.

Returns

A Float32 value between 0 and 1, representing the likelihood of needle being in haystack. Float32.

Example

SELECT ngramSearch('Hello World','World Hello')

Result:

0.5

ngramSearchCaseInsensitive

Calculates a non-symmetric n-gram similarity score between two strings, ignoring case, for fuzzy string searching.

Syntax

ngramSearchCaseInsensitive(haystack, needle)

Arguments

  • haystack: The string to search within. String.
  • needle: The string to search for. String.

Returns

A Float32 value between 0 and 1, representing the likelihood of needle being in haystack, ignoring case. Float32.

Example

SELECT ngramSearchCaseInsensitive('Hello World','hello')

Result:

1

ngramSearchUTF8

Calculates a non-symmetric n-gram similarity score between two UTF-8 encoded strings for fuzzy string searching.

Syntax

ngramSearchUTF8(haystack, needle)

Arguments

  • haystack: The UTF-8 encoded string to search within. String.
  • needle: The UTF-8 encoded string to search for. String.

Returns

A Float32 value between 0 and 1, representing the likelihood of needle being in haystack (UTF-8). Float32.

Example

SELECT ngramSearchUTF8('абвгдеёжз', 'гдеёзд')

Result:

0.5

ngramSearchCaseInsensitiveUTF8

Calculates a non-symmetric n-gram similarity score between two UTF-8 encoded strings, ignoring case, for fuzzy string searching.

Syntax

ngramSearchCaseInsensitiveUTF8(haystack, needle)

Arguments

  • haystack: The UTF-8 encoded string to search within. String.
  • needle: The UTF-8 encoded string to search for. String.

Returns

A Float32 value between 0 and 1, representing the likelihood of needle being in haystack (UTF-8, case-insensitive). Float32.

Example

SELECT ngramSearchCaseInsensitiveUTF8('абвГДЕёжз', 'АбвгдЕЁжз')

Result:

0.57142854

countSubstrings

Counts the number of non-overlapping occurrences of a substring within a string.

Functions countSubstringsCaseInsensitive and countSubstringsCaseInsensitiveUTF8 provide case-insensitive and case-insensitive + UTF-8 variants of this function respectively.

Syntax

countSubstrings(haystack, needle[, start_pos])

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • needle: Substring to be counted. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The total number of times the needle substring appears in the haystack. UInt64.

Example

SELECT countSubstrings('aaaa', 'aa')

Result:

┌─countSubstrings('aaaa', 'aa')─┐
│                             2 │
└───────────────────────────────┘

Example with start_pos argument:

SELECT countSubstrings('abc___abc', 'abc', 4)

Result:

┌─countSubstrings('abc___abc', 'abc', 4)─┐
│                                      1 │
└────────────────────────────────────────┘

countSubstringsCaseInsensitive

Counts the number of non-overlapping occurrences of a substring within a string, ignoring case.

Syntax

countSubstringsCaseInsensitive(haystack, needle[, start_pos])

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • needle: Substring to be counted. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The total number of times the needle substring appears in the haystack, ignoring case. UInt64.

Example

SELECT countSubstringsCaseInsensitive('AAAA', 'aa')

Result:

┌─countSubstringsCaseInsensitive('AAAA', 'aa')─┐
│                                            2 │
└──────────────────────────────────────────────┘

Example with start_pos argument:

SELECT countSubstringsCaseInsensitive('abc___ABC___abc', 'abc', 4)

Result:

┌─countSubstringsCaseInsensitive('abc___ABC___abc', 'abc', 4)─┐
│                                                           2 │
└─────────────────────────────────────────────────────────────┘

countSubstringsCaseInsensitiveUTF8

Counts the number of non-overlapping occurrences of a UTF-8 encoded substring within a UTF-8 encoded string, ignoring case.

Syntax

countSubstringsCaseInsensitiveUTF8(haystack, needle[, start_pos])

Arguments

  • haystack: UTF-8 encoded string in which the search is performed. String or Enum.
  • needle: UTF-8 encoded substring to be counted. String.
  • start_pos: Position (1-based) in haystack at which the search starts. UInt. Optional.

Returns

The total number of times the needle substring appears in the haystack (UTF-8, case-insensitive). UInt64.

Example

SELECT countSubstringsCaseInsensitiveUTF8('ложка, кошка, картошка', 'КА')

Result:

┌─countSubstringsCaseInsensitiveUTF8('ложка, кошка, картошка', 'КА')─┐
│                                                                  4 │
└────────────────────────────────────────────────────────────────────┘

Example with start_pos argument:

SELECT countSubstringsCaseInsensitiveUTF8('ложка, кошка, картошка', 'КА', 13)

Result:

┌─countSubstringsCaseInsensitiveUTF8('ложка, кошка, картошка', 'КА', 13)─┐
│                                                                      2 │
└────────────────────────────────────────────────────────────────────────┘

countMatches

Counts the number of times a regular expression pattern matches within a string.

Syntax

countMatches(haystack, pattern)

Arguments

  • haystack: The string to search within. String.
  • pattern: The regular expression pattern to match, using re2 regular expression syntax. String.

Returns

The total number of matches found for the pattern in the haystack. UInt64.

Example

SELECT countMatches('foobar.com', 'o+')

Result:

┌─countMatches('foobar.com', 'o+')─┐
│                                2 │
└──────────────────────────────────┘
SELECT countMatches('aaaa', 'aa')

Result:

┌─countMatches('aaaa', 'aa')────┐
│                             2 │
└───────────────────────────────┘

countMatchesCaseInsensitive

Counts the number of times a regular expression pattern matches within a string, ignoring case.

Syntax

countMatchesCaseInsensitive(haystack, pattern)

Arguments

  • haystack: The string to search within. String.
  • pattern: The regular expression pattern to match, using re2 regular expression syntax. String.

Returns

The total number of matches found for the pattern in the haystack, ignoring case. UInt64.

Example

SELECT countMatchesCaseInsensitive('AAAA', 'aa')

Result:

┌─countMatchesCaseInsensitive('AAAA', 'aa')────┐
│                                            2 │
└──────────────────────────────────────────────┘

regexpExtract

Extracts a specific capturing group from the first match of a regular expression in a string.

Syntax

regexpExtract(haystack, pattern[, index])

Alias: REGEXP_EXTRACT(haystack, pattern[, index]).

Arguments

  • haystack: The string to search within. String.
  • pattern: The regular expression pattern with capturing groups, using re2 regular expression syntax. String.
  • index: An integer (0-based) specifying which capturing group to extract. 0 extracts the entire match. Defaults to 1. UInt or Int. Optional.

Returns

The extracted substring corresponding to the specified capturing group. String.

Example

SELECT
    regexpExtract('100-200', '(\\d+)-(\\d+)', 1),
    regexpExtract('100-200', '(\\d+)-(\\d+)', 2),
    regexpExtract('100-200', '(\\d+)-(\\d+)', 0),
    regexpExtract('100-200', '(\\d+)-(\\d+)')

Result:

┌─regexpExtract('100-200', '(\\d+)-(\\d+)', 1)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 2)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)', 0)─┬─regexpExtract('100-200', '(\\d+)-(\\d+)')─┐
│ 100                                          │ 200                                          │ 100-200                                      │ 100                                       │
└──────────────────────────────────────────────┴──────────────────────────────────────────────┴──────────────────────────────────────────────┴───────────────────────────────────────────┘

hasSubsequence

Checks if one string is a subsequence of another. A subsequence can be derived by deleting zero or more characters from the original string without changing the order of the remaining characters.

Syntax

hasSubsequence(haystack, needle)

Arguments

  • haystack: The string to search within. String.
  • needle: The subsequence to search for. String.

Returns

Returns 1 if needle is a subsequence of haystack, otherwise returns 0. UInt8.

Example

SELECT hasSubsequence('garbage', 'arg')

Result:

┌─hasSubsequence('garbage', 'arg')─┐
│                                1 │
└──────────────────────────────────┘

hasSubsequenceCaseInsensitive

Checks if one string is a subsequence of another, ignoring case differences.

Syntax

hasSubsequenceCaseInsensitive(haystack, needle)

Arguments

  • haystack: The string to search within. String.
  • needle: The subsequence to search for. String.

Returns

Returns 1 if needle is a subsequence of haystack (case-insensitive), otherwise returns 0. UInt8.

Example

SELECT hasSubsequenceCaseInsensitive('garbage', 'ARG')

Result:

┌─hasSubsequenceCaseInsensitive('garbage', 'ARG')─┐
│                                               1 │
└─────────────────────────────────────────────────┘

hasSubsequenceUTF8

Checks if one UTF-8 encoded string is a subsequence of another UTF-8 encoded string.

Syntax

hasSubsequenceUTF8(haystack, needle)

Arguments

  • haystack: The UTF-8 encoded string to search within. String.
  • needle: The UTF-8 encoded subsequence to search for. String.

Returns

Returns 1 if needle is a subsequence of haystack (UTF-8), otherwise returns 0. UInt8.

Example

SELECT hasSubsequenceUTF8('ClickHouse - столбцовая система управления базами данных', 'система')

Result:

┌─hasSubsequenceUTF8('ClickHouse - столбцовая система управления базами данных', 'система')─┐
│                                                                                         1 │
└───────────────────────────────────────────────────────────────────────────────────────────┘

hasSubsequenceCaseInsensitiveUTF8

Checks if one UTF-8 encoded string is a subsequence of another UTF-8 encoded string, ignoring case.

Syntax

hasSubsequenceCaseInsensitiveUTF8(haystack, needle)

Arguments

  • haystack: The UTF-8 encoded string to search within. String.
  • needle: The UTF-8 encoded subsequence to search for. String.

Returns

Returns 1 if needle is a subsequence of haystack (UTF-8, case-insensitive), otherwise returns 0. UInt8.

Example

SELECT hasSubsequenceCaseInsensitiveUTF8('ClickHouse - столбцовая система управления базами данных', 'СИСТЕМА')

Result:

┌─hasSubsequenceCaseInsensitiveUTF8('ClickHouse - столбцовая система управления базами данных', 'СИСТЕМА')─┐
│                                                                                                        1 │
└──────────────────────────────────────────────────────────────────────────────────────────────────────────┘

hasToken

Checks if a specific token (a maximal length substring between non-alphanumeric ASCII characters or string boundaries) is present in a string.

Syntax

hasToken(haystack, token)

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • token: The token to search for. String.

Returns

Returns 1 if the token is found in the haystack, otherwise returns 0. UInt8.

Example

SELECT hasToken('Hello World','Hello')

Result:

1

hasTokenOrNull

Checks if a specific token is present in a string, returning 1 if found, 0 if not found, or null if the token is ill-formed.

Syntax

hasTokenOrNull(haystack, token)

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • token: The token to search for. String.

Returns

Returns 1 if the token is present, 0 if not present, or null if the token is ill-formed. Nullable(UInt8).

Example

Where hasToken would throw an error for an ill-formed token, hasTokenOrNull returns null for an ill-formed token.

SELECT hasTokenOrNull('Hello World','Hello,World')

Result:

null

hasTokenCaseInsensitive

Checks if a specific token is present in a string, ignoring case differences.

Syntax

hasTokenCaseInsensitive(haystack, token)

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • token: The token to search for. String.

Returns

Returns 1 if the token is found in the haystack (case-insensitive), otherwise returns 0. UInt8.

Example

SELECT hasTokenCaseInsensitive('Hello World','hello')

Result:

1

hasTokenCaseInsensitiveOrNull

Checks if a specific token is present in a string, ignoring case, returning 1 if found, 0 if not found, or null if the token is ill-formed.

Syntax

hasTokenCaseInsensitiveOrNull(haystack, token)

Arguments

  • haystack: String in which the search is performed. String or Enum.
  • token: The token to search for. String.

Returns

Returns 1 if the token is present (case-insensitive), 0 if not present, or null if the token is ill-formed. Nullable(UInt8).

Example

Where hasTokenCaseInsensitive would throw an error for an ill-formed token, hasTokenCaseInsensitiveOrNull returns null for an ill-formed token.

SELECT hasTokenCaseInsensitiveOrNull('Hello World','hello,world')

Result:

null
Updated