For searching in large database tables we usually use PostgreSQL's fulltext search capabilities.
While this works reasonably well for content primarily consisting of prose, it is not necessarily a good solution for all use cases. The main issue is that it is only possible to search for prefixes of text tokens, which can potentially be unexpected for users.
One example are dates:
If you index the text 2019-01-23 15:16
, PostgreSQL will create the following tokens: 2019, -01, -23, 15 16
. A user searching for 01-23
will get no results (-01-23
would have worked).
Serial numbers (one case was s7-1500
) pose similar problems.
Potential solution
One possibility is to use plain old LIKE
queries:
SELECT * FROM contents WHERE search_text LIKE '%my-search-string%'
This has obvious performance problems for large datasets. You can mitigate those issues somewhat by using trigram indexes. These indexes work by collecting all combinations of 3 consecutive letters present in the source text. Trigram indexes are usually used for similarity search, but they do work for LIKE
queries as well.
To use them, in a migration, enable the PostgreSQL trigram extension and add a trigram index on your search_text
column:
class MoveToTrigramIndexing < ActiveRecord::Migration[5.2]
def change
enable_extension 'pg_trgm' # this requires superuser permissions for the database user
add_index :contents, :search_test, using: :gin, opclass: :gin_trgm_ops
end
end
It is probably not a good idea to grant superuser capabilities to all database users. If you don't want to grant them, enable the extension manually with
sudo -u postgres psql -c "ENABLE EXTENSION IF NOT EXISTS pg_trgm" $DATABASE_NAME
PostgreSQL should now automatically start using the index for all LIKE
queries, when it judges this to be a performance improvement.
This will however never work for search terms of fewer than 3 characters. In those cases PostgreSQL will fall back to a full table scan.
Important
In theory, the index should also be used for
ILIKE
queries, and for queries using regular expressions, but I have not seen this working. I suspect that PostgreSQL judges the cost of using the index higher forILIKE
and might start using it for even large data sets. I would still recommend to simply convert search texts and queries to lowercase and useLIKE
instead.