Postgres is a pretty powerful tool, and it offers a slew of useful features beyond mere data persistence and query execution. One such useful feature is string matching the capabilities of which are available both as extension e.g.
fuzzy string matching
Show archive.org snapshot
. You can for instance pass a regular expression, that a particular model attribute is to be matched against directly to the database query, for instance if I want all users of a certain email domain to assign them to a newly created Organization
model. I can simply execute this query:
members = User.where("email ~* '.*@9elements.com$'")
Not only is this much more performant than completing the same task in Ruby in memory, it allows us to push the data crunching logic that would otherwise litter our models with Regexes and matching conditions into the database layer.
If we want to get more fancy, we can make use of Postgres' fuzzystrmatch extension, which can be enabled via the following migration:
class AddFuzzyStringMatching < ActiveRecord::Migration
def change
enable_extension "fuzzystrmatch"
end
end
This gives us access to fuzzy matching functions such as the
Levenshtein distance
Show archive.org snapshot
, which makes it possible to handle more complex logic at the database level.
Suppose for instance that we have a database full of user generated content, such as restaurant ratings, that associates a rating with a particular restaurant, whose name is given by the user in plain text.
We would like to match this input to an authoritative source of restaurant names which we store in the restaurants
database in order to perform subsequent processing by either 'mechanical turks' or advanced text analysis.
We can use the postgres levenshtein
functions to retrieve groups of similar restaurant names from user input for each 'ground truth' restaurant name:
SELECT rest.name AS restaurant_name, array_agg(cand_rest.name)
FROM restaurant rest
JOIN candidate_restaurant cand_rest ON levensthein(rest.name, cand_rest.name) < 4
GROUP BY restaurant_name;
array_agg
is a postgres aggregation similar to sum
or avg
, that work with GROUP BY
.
Here we are retrieving all restaurant names together with a list of names from the user input table which are within 4 edit operations (deletion, insertion, substitution) of the respective, correct restaurant name.
We can execute raw sql queries like this against the database using ActiveRecord connection:
@connection = ActiveRecord::Base.connection
result = @connection.exec_query('SELECT rest.name AS restaurant_name...')
Alternatively, we could even use the levenshtein
function directly in an ActiveRecord where clause to retrieve matching restaurant names:
candidate_name = "Ausora"
restaurants = Restaurant.where('levenshtein(name, ?) < 4', candidate_name)