String matching in Postgres

Updated . Posted . Visible to the public.

Postgres is a pretty powerful tool, and it offers a slew of useful features beyond mere data persistence and query execution. One such useful feature is string matching the capabilities of which are available both as extension e.g. fuzzy string matching Show archive.org snapshot . You can for instance pass a regular expression, that a particular model attribute is to be matched against directly to the database query, for instance if I want all users of a certain email domain to assign them to a newly created Organization model. I can simply execute this query:

members = User.where("email ~* '.*@9elements.com$'")

Not only is this much more performant than completing the same task in Ruby in memory, it allows us to push the data crunching logic that would otherwise litter our models with Regexes and matching conditions into the database layer.

If we want to get more fancy, we can make use of Postgres' fuzzystrmatch extension, which can be enabled via the following migration:

class AddFuzzyStringMatching < ActiveRecord::Migration
  def change
    enable_extension "fuzzystrmatch"
  end
end

This gives us access to fuzzy matching functions such as the Levenshtein distance Show archive.org snapshot , which makes it possible to handle more complex logic at the database level.
Suppose for instance that we have a database full of user generated content, such as restaurant ratings, that associates a rating with a particular restaurant, whose name is given by the user in plain text.
We would like to match this input to an authoritative source of restaurant names which we store in the restaurants database in order to perform subsequent processing by either 'mechanical turks' or advanced text analysis.
We can use the postgres levenshtein functions to retrieve groups of similar restaurant names from user input for each 'ground truth' restaurant name:

SELECT rest.name AS restaurant_name, array_agg(cand_rest.name)
FROM restaurant rest
JOIN candidate_restaurant cand_rest ON levensthein(rest.name, cand_rest.name) < 4
GROUP BY restaurant_name;

array_agg is a postgres aggregation similar to sum or avg, that work with GROUP BY.
Here we are retrieving all restaurant names together with a list of names from the user input table which are within 4 edit operations (deletion, insertion, substitution) of the respective, correct restaurant name.

We can execute raw sql queries like this against the database using ActiveRecord connection:

@connection = ActiveRecord::Base.connection
result = @connection.exec_query('SELECT rest.name AS restaurant_name...')

Alternatively, we could even use the levenshtein function directly in an ActiveRecord where clause to retrieve matching restaurant names:

candidate_name = "Ausora"
restaurants = Restaurant.where('levenshtein(name, ?) < 4', candidate_name)
Jan Bussieck
Last edit
Jan Bussieck
Posted by Jan Bussieck to Speqtor (2015-06-04 13:50)