Trigram indexing as an alternative to PostgreSQL fulltext search
For searching in large database tables we usually use PostgreSQL's fulltext search capabilities.
While this works reasonably well for content primarily consisting of prose, it is not necessarily a good solution for all use cases. The main issue is that it is only possible to search for prefixes of text tokens, which can potentially be unexpected for users.
One example are dates:
If you index the text
2019-01-23 15:16, PostgreSQL will create the following tokens:
2019, -01, -23, 15 16. A user searching for
01-23 will get no results (
-01-23 would have worked).
Serial numbers (one case was
s7-1500) pose similar problems.
One possibility is to use plain old
SELECT * FROM contents WHERE search_text LIKE '%my-search-string%'
This has obvious performance problems for large datasets. You can mitigate those issues somewhat by using trigram indexes. These indexes work by collecting all combinations of 3 consecutive letters present in the source text. Trigram indexes are usually used for similarity search, but they do work for
LIKE queries as well.
To use them, in a migration, enable the PostgreSQL trigram extension and add a trigram index on your
class MoveToTrigramIndexing < ActiveRecord::Migration[5.2] def change enable_extension 'pg_trgm' # this requires superuser permissions for the database user add_index :contents, :search_test, using: :gin, opclass: :gin_trgm_ops end end
It is probably not a good idea to grant superuser capabilities to all database users. If you don't want to grant them, enable the extension manually with
sudo -u postgres psql -c "ENABLE EXTENSION IF NOT EXISTS pg_trgm" $DATABASE_NAME
PostgreSQL should now automatically start using the index for all
LIKE queries, when it judges this to be a performance improvement.
This will however never work for search terms of fewer than 3 characters. In those cases PostgreSQL will fall back to a full table scan.
In theory, the index should also be used for
ILIKE queries, and for queries using regular expressions, but I have not seen this working. I suspect that PostgreSQL judges the cost of using the index higher for
ILIKE and might start using it for even large data sets. I would still recommend to simply convert search texts and queries to lowercase and use