PostgreSQL can cosplay as a full-text search engine. It doesn't have the features or fidelity of ElasticSearch or Algolia, but it's good enough if you just need to search and rank large volumes of text.
This card will teach you how to index, search and rank your Rails models in a PostgreSQL full-text index. We will do this without using any gems aside from ActiveRecord. While there are gems like pg_search Show archive.org snapshot or pg_fulltext Show archive.org snapshot , manual integration requires very little code. And since you are inevitably going to tweak your search implementation, you may find it easier to do so directly in your own code.
For the examples below we will search a simple Message
model. It stores messages with a subject line and the message body:
class CreateMessage < ActiveRecord::Migration[7.2]
def change
create_table :messages do |t|
t.string :subject
t.text :body
t.timestamps
end
end
end
class Message < ApplicationRecord
end
This card will start with a very basic search example. We will then gradually improve our implementation with additional requirements you will probably encounter. A good companion to this card is the Controlling Text Search Show archive.org snapshot reference from the PostgreSQL documentation.
Basic example
We want to implement a Message.search
method that looks for a given query
in the body
column. It should return a scope (relation) with all matches:
Message.search('quarterly results') # => #<ActiveRecord::Relation [...]>
This is all we need to implement this method:
class Message < ApplicationRecord
scope :search, ->(query) { where(<<~SQL, query: query) }
to_tsvector('simple', messages.body) @@ websearch_to_tsquery('simple', :query)
SQL
end
This implementation only finds matching records, without a particular order. We will learn how to rank results later.
The string 'simple'
denotes the PostgreSQL dictionary for tokenization and normalization. simple
is a very basic dictionary that only lower-cases words and removes some stop words. We will use more advanced dictionaries when we implement stemming later.
Google-like queries are supported by default
The websearch_to_tsquery
function is provided by PostgreSQL and supports Google-like query grammar:
Query | Result |
---|---|
Message.search('foo bar') |
Both foo and bar must match somewhere |
Message.search('foo OR bar') |
Either foo or bar must match somewhere |
Message.search('"foo bar"') |
The phrase foo bar must match in that exact order |
Message.search('foo -bar') |
foo must match, but bar must not |
Searching in multiple columns
To search more than one column, we can apply the @@
operator multiple times and OR
the results:
class Message < ApplicationRecord
scope :search, ->(query) { where(<<~SQL, query: query) }
(to_tsvector('simple', messages.subject) @@ websearch_to_tsquery('simple', :query)) OR
(to_tsvector('simple', messages.body) @@ websearch_to_tsquery('simple', :query))
SQL
end
Improving performance by caching vectors
The SQL queries we used so far are inefficient. We tokenize and normalize every column value (messages.body
) into a tsvector
, use that vector for a single comparison, and then discard it. The next query will recreate that vector, even if the underlying data hasn't changed.
A more performant approach would be to cache the vector in a separate column, and update it when the underlying data changes. That way all queries can run on the same, cached vector value. We can even index the vector cache to further improve performance.
To implement this, we create a new column search_tsvector
and create a GIN
index over it:
class AddSearchTsvectorToMessage < ActiveRecord::Migration[7.2]
def up
add_column :messages, :search_tsvector, :tsvector
add_index :messages, :search_tsvector, using: :gin
end
def down
remove_column :messages, :search_tsvector
end
end
We now adjust our model so it updates the #search_tsvector
attribute when subject
or body
changes. We also change our Message.search
method so it runs the query against the cached vector value:
class Message < ApplicationRecord
scope :search, ->(query) { where(<<~SQL, query: query) }
search_tsvector @@ websearch_to_tsquery('simple', :query)
SQL
after_save :index_for_search
def index_for_search
return unless saved_change_to_subject? || saved_change_to_body?
Message.where(id: id).update_all(<<~SQL)
search_tsvector =
to_tsvector('simple', messages.subject) ||
to_tsvector('simple', messages.body)
SQL
end
end
Ranking results
So far our Message.search
method has returned all matching messages, in no particular order. We often want to promote results where words match more often, or where matching words are closer to each other. We can do so by ordering against the ts_rank_cd
function:
class Message < ApplicationRecord
scope :search, ->(query) {
query_sql = sanitize_sql(["websearch_to_tsquery('simple', ?)", query])
self
.where(Arel.sql("search_tsvector @@ #{query_sql}"))
.order(Arel.sql("ts_rank_cd(search_tsvector, #{query_sql}) DESC"))
}
# ...
end
Normalizing long documents
Because longer documents are more likely to contain a search term by chance, it can make sense to consider document length. For this the ts_rank_cd
function takes a third argument with a normalization setting.
For example, a normalization of 1
will divide the rank by 1 + the logarithm of the document length:
class Message < ApplicationRecord
scope :search, ->(query) {
query_sql = sanitize_sql(["websearch_to_tsquery('simple', ?)", query])
self
.where(Arel.sql("search_tsvector @@ #{query_sql}"))
.order(Arel.sql("ts_rank_cd(search_tsvector, #{query_sql}, 1) DESC"))
}
# ...
end
There are many normalization strategies. See the PostgreSQL documentation Show archive.org snapshot for a list.
Weighing columns differently
When we search multiple columns, we sometimes want to weigh one attribute higher than another. That weight affects how much a match contributes to the document's rank.
PostgreSQL offers four categories which (by default) have the following ranking weights:
Category | Weight |
---|---|
A |
1.0 |
B |
0.4 |
C |
0.2 |
D |
0.1 |
We can set these weights when we create the tsvector
value:
class Message < ApplicationRecord
# ...
def index_for_search
return unless saved_change_to_subject? || saved_change_to_body?
Message.where(id: id).update_all(<<~SQL)
search_tsvector =
setweight(to_tsvector('simple', messages.subject), 'A') ||
setweight(to_tsvector('simple', messages.body), 'B')
SQL
end
end
Boosting or penalizing records
Sometimes you want the rank to reflect additional criteria than just the frequency of word matches. In our example we may want to higher rank messages from people in our address book, or boost messages that we received recently.
We can influence the ranking value by multiplying or adding to the return value of the ts_rank_cd()
function. The following would penalize old messages by reducing the rank for each day since the message was received:
class Message < ApplicationRecord
scope :search, ->(query) {
query_sql = sanitize_sql(["websearch_to_tsquery('simple', ?)", query])
penalize_age = '0.0002 * EXTRACT(DAY FROM (now() - messages.created_at))'
self
.where(Arel.sql("search_tsvector @@ #{query_sql}"))
.order(Arel.sql("ts_rank_cd(search_tsvector, #{query_sql}) - #{penalize_age} DESC"))
}
# ...
end
Stemming
Stemming will reduce inflected words to their word stem. E.g. the words "fishing", "fished", and "fisher" can all be reduced to the stem "fish". Stemming can make our search more forgiving, as small variations in word usage can still yield a match.
So far we have used the simple
dictionary, which only applies basic normalization by lower-casing words. Instead of simple
we can use any available
PostgreSQL dictionary
Show archive.org snapshot
, or even create our own.
For example, the english
dictionary provides stemming for English text and also removes English stopwords:
class Message < ApplicationRecord
scope :search, ->(query) { where(<<~SQL, query: query) }
search_tsvector @@ websearch_to_tsquery('english', :query)
SQL
after_save :index_for_search
def index_for_search
return unless saved_change_to_subject? || saved_change_to_body?
Message.where(id: id).update_all(<<~SQL)
search_tsvector =
to_tsvector('english', messages.subject) ||
to_tsvector('english', messages.body)
SQL
end
end