A trigram index is a type of inverted index created on a STRING
column. Trigram indexes are used to efficiently search for strings in large tables without providing an exact search term.
This page describes how to create and use trigram indexes on CockroachDB.
Some PostgreSQL syntax and features are currently unsupported. For details, see Unsupported features.
How do trigram indexes work?
Trigram indexes make substring and similarity matches efficient by indexing the unique trigrams of a string. A trigram is a group of three consecutive characters in a string.
To display the trigrams within a string, use the show_trgm()
built-in function:
SELECT show_trgm('word');
show_trgm
-------------------------------
{" w"," wo",ord,"rd ",wor}
A trigram index stores every unique trigram within each string being indexed. When you search a trigram index for a value, the database retrieves all of the entries in the index that match enough of the trigrams of the search value to satisfy the match. The type of match depends on the comparison operator:
- Exact for an equality (
=
) or pattern matching (LIKE
/ILIKE
) search. - Inexact for a similarity (
%
) search.
Trigrams enable pattern matching even when the prefix of the string is not known. For example:
SELECT * FROM t WHERE text_col LIKE '%foobar%';
Fuzzy string matching based on text similarity is useful when:
- The spelling of a search term is not exact.
- The exact search term is not known.
For example, if you don't know how to spell a name in your database, you can use a %
comparison to perform a fuzzy search. When applied to a STRING
column, the %
operator matches values that meet a configured similarity threshold.
To search for names like "Steven" in column first_name
:
SELECT first_name FROM users WHERE first_name % 'steven';
first_name
--------------
Stephen
Steve
Seven
Trigram matching only works if the search term has at least 3 characters.
Fuzzy string matching, as well as LIKE
and ILIKE
pattern matching, can be very slow on large datasets. To make both types of searches more efficient, create a trigram index. For an example, see Use a trigram index to speed up fuzzy string matching.
Creation
To create a trigram index, use the CREATE INDEX
syntax that defines an inverted index, specifying a STRING
column and the gin_trgm_ops
or gist_trgm_ops
opclass. It is necessary to specify an opclass in order to enable trigram-based comparisons.
Using the PostgreSQL-compatible syntax:
CREATE INDEX {optional name} ON {table} USING GIN({column} gin_trgm_ops);
CREATE INDEX {optional name} ON {table} USING GIST({column} gist_trgm_ops);
Note:GIN and GiST indexes are implemented identically on CockroachDB.
GIN
andGIST
are therefore synonymous when defining a trigram index.Using
CREATE INVERTED INDEX
:CREATE INVERTED INDEX {optional name} ON {table} ({column} {opclass});
Comparisons
Trigram indexes on STRING
columns support the following comparison operators:
- equality:
=
. Note that standardbtree
secondary indexes may perform better than trigram indexes for equality searches. - pattern matching (case-sensitive):
LIKE
- pattern matching (case-insensitive):
ILIKE
- similarity matching:
%
. This operator returnstrue
if the strings in the comparison have a similarity that meets or exceeds the threshold set by thepg_trgm.similarity_threshold
session variable.
For usage examples, see Use a trigram index to speed up fuzzy string matching.
Examples
Create various trigram indexes
Suppose you have a table with the following columns:
CREATE TABLE t (a INT, w STRING);
The following examples illustrate how to create various trigram indexes on column w
.
A GIN index with trigram matching enabled:
CREATE INDEX ON t USING GIN (w gin_trgm_ops);
A partial index with trigram matching enabled:
CREATE INDEX ON t USING GIN (w gin_trgm_ops) WHERE a > 0;
A multi-column index with trigram matching enabled:
CREATE INDEX ON t USING GIN (a, w gin_trgm_ops);
An expression index with trigram matching enabled:
CREATE INDEX ON t USING GIN ((json_col->>'json_text_field'))
Use a trigram index to speed up fuzzy string matching
Create a table with a
STRING
column:CREATE TABLE t (w STRING);
Populate the table with sample values:
INSERT INTO t VALUES ('foo'), ('bar'), ('wordy'), ('world'), ('whorl'), ('wort'), ('worm'), ('norm'), ('weird'), ('worried'), ('wofoord'), ('wobarrd'); INSERT INTO t SELECT 'empty' FROM generate_series(1, 10000);
See how trigram matching performs without a trigram index. Retrieve the columns with values similar to
word
, using the%
operator. Sort the results by the output of thesimilarity()
built-in function:SELECT w, similarity(w, 'word') FROM t WHERE w % 'word' ORDER BY similarity DESC, w;
w | similarity ----------+---------------------- wordy | 0.5714285714285714 wofoord | 0.4444444444444444 worm | 0.42857142857142855 wort | 0.42857142857142855 world | 0.375 wobarrd | 0.3 worried | 0.3 (7 rows) Time: 30ms total (execution 30ms / network 0ms)
Values are not included in the results if their similarities do not meet the threshold set by
pg_trgm.similarity_threshold
, which defaults to0.3
. For example:SELECT similarity('weird', 'word');
similarity ---------------------- 0.2222222222222222
Notice that the fuzzy search took 30 milliseconds to execute. Without a trigram index, the statement performs a full scan, which you can verify using
EXPLAIN
:EXPLAIN SELECT w, similarity(w, 'word') FROM t WHERE w % 'word' ORDER BY similarity DESC, w;
info ------------------------------------------------------------------------------------------------- distribution: local vectorized: true • sort │ estimated row count: 3,337 │ order: -similarity,+w │ └── • render │ └── • filter │ estimated row count: 3,337 │ filter: w % 'word' │ └── • scan estimated row count: 10,012 (100% of the table; stats collected 59 minutes ago) table: t@t_pkey spans: FULL SCAN
To speed up the fuzzy search, create a trigram index on column
w
:CREATE INDEX ON t USING GIN (w gin_trgm_ops);
Check that the statement uses the trigram index:
EXPLAIN SELECT w, similarity(w, 'word') FROM t WHERE w % 'word' ORDER BY similarity DESC, w;
info ------------------------------------------------------------------------------------------------------ distribution: local vectorized: true • sort │ estimated row count: 3,337 │ order: -similarity,+w │ └── • render │ └── • filter │ estimated row count: 3,337 │ filter: w % 'word' │ └── • index join │ estimated row count: 1,112 │ table: t@t_pkey │ └── • inverted filter │ estimated row count: 1,112 │ inverted column: w_inverted_key │ num spans: 2 │ └── • scan estimated row count: 1,112 (11% of the table; stats collected 4 minutes ago) table: t@t_w_idx spans: 2 spans
Execute the statement again and note the improved performance:
SELECT w, similarity(w, 'word') FROM t WHERE w % 'word' ORDER BY similarity DESC, w;
w | similarity ----------+---------------------- wordy | 0.5714285714285714 wofoord | 0.4444444444444444 worm | 0.42857142857142855 wort | 0.42857142857142855 world | 0.375 worried | 0.3 (6 rows) Time: 4ms total (execution 4ms / network 0ms)
Pattern matching with
LIKE
andILIKE
is also accelerated by a trigram index:EXPLAIN SELECT * FROM t WHERE w LIKE '%foo%';
info ------------------------------------------------------------------------------------------ distribution: local vectorized: true • filter │ estimated row count: 3,337 │ filter: w LIKE '%foo%' │ └── • index join │ estimated row count: 1,112 │ table: t@t_pkey │ └── • scan estimated row count: 1,112 (11% of the table; stats collected 3 minutes ago) table: t@t_w_idx spans: 1 span
Unsupported features
The following PostgreSQL syntax and features are currently unsupported. For details, see the tracking issue.
word_similarity()
built-in function.strict_word_similarity()
built-in function.%>
and<%
comparisons and acceleration.<<%
and%>>
comparisons and acceleration.<->
,<<->
,<->>
,<<<->
, and<->>>
comparisons.- Acceleration on regex string matching.
%
comparisons,show_trgm
, and trigram index creation on collated strings.