Trigram indexing as an alternative to PostgreSQL fulltext search

Updated . Posted . Visible to the public.

For searching in large database tables we usually use PostgreSQL's fulltext search capabilities.

While this works reasonably well for content primarily consisting of prose, it is not necessarily a good solution for all use cases. The main issue is that it is only possible to search for prefixes of text tokens, which can potentially be unexpected for users.

One example are dates:

If you index the text 2019-01-23 15:16, PostgreSQL will create the following tokens: 2019, -01, -23, 15 16. A user searching for 01-23 will get no results (-01-23 would have worked).

Serial numbers (one case was s7-1500) pose similar problems.

Potential solution

One possibility is to use plain old LIKE queries:

SELECT * FROM contents WHERE search_text LIKE '%my-search-string%'

This has obvious performance problems for large datasets. You can mitigate those issues somewhat by using trigram indexes. These indexes work by collecting all combinations of 3 consecutive letters present in the source text. Trigram indexes are usually used for similarity search, but they do work for LIKE queries as well.

To use them, in a migration, enable the PostgreSQL trigram extension and add a trigram index on your search_text column:

class MoveToTrigramIndexing < ActiveRecord::Migration[5.2]
  def change
    enable_extension 'pg_trgm' # this requires superuser permissions for the database user
    add_index :contents, :search_test, using: :gin, opclass: :gin_trgm_ops
  end
end

It is probably not a good idea to grant superuser capabilities to all database users. If you don't want to grant them, enable the extension manually with

sudo -u postgres psql -c "ENABLE EXTENSION IF NOT EXISTS pg_trgm" $DATABASE_NAME

PostgreSQL should now automatically start using the index for all LIKE queries, when it judges this to be a performance improvement.
This will however never work for search terms of fewer than 3 characters. In those cases PostgreSQL will fall back to a full table scan.

Important

In theory, the index should also be used for ILIKE queries, and for queries using regular expressions, but I have not seen this working. I suspect that PostgreSQL judges the cost of using the index higher for ILIKE and might start using it for even large data sets. I would still recommend to simply convert search texts and queries to lowercase and use LIKE instead.

Tobias Kraze
Last edit
Michael Leimstädtner
License
Source code in this card is licensed under the MIT License.
Posted by Tobias Kraze to makandra dev (2019-01-23 14:10)