Postgres: DISTINCT ON lets you select only one record per ordered attribute(s) for each group

  • To retrieve only unique combinations of the selected attributes: You can omit rows, where all selected columns are equal with the DISTINCT statement.
  • To retrieve the group wise maximum of certain columns: You can keep only one record for each group with the DISTINCT ON statement, to omit equal rows within each specified group.

Use case

You have a query where you want only one record for a set of specifically ordered attributes.

How to use?

Let's say we look at the example how to query only the latest post for each user:

SELECT DISTINCT ON (user_id) posts.*
FROM posts 
ORDER BY user_id, published_at DESC;

In Ruby and ActiveRecord it is just:

Post.select('DISTINCT ON (user_id) *').order(:user_id, published_at: :desc)

How does it work?

Internally this will produce unique groups on user id and take only the latest published post for each group.

As pseudo sql this could also be written as:

SELECT *
FROM (
  SELECT user_id, posts.*
  FROM posts
  ORDER BY user_id, published_at
) p
GROUP BY user_id

Be aware that even though you can find a lot of examples like the above on the web, it may or may not be valid sql Show archive.org snapshot ! The reason for this is, that many of the modern sql optimizers will not take the ordering of subqueries into account for the actual execution.

Limitations

Performance

If performance is critical because you are dealing with large records for each group, you have to consider something else like using group by with multicolumn indices. This stackoverflow answer Show archive.org snapshot directs also to some alternatives.

Ordering

When you have to order additionally by another column as the one you have set for DISTINCT ON, you have to make sure that the arguments in ORDER BY have to be the same as the first arguments in any order in DISTINCT ON and add any further ordered attributes thereafter:

SELECT DISTINCT ON (user_id, id) posts.*
FROM posts 
ORDER  BY user_id, id, created_at DESC;

But if you want the complete result to be ordered by completely different column you have to wrap the distinct query as a subquery Show archive.org snapshot .

In SQL it would look like

SELECT *
FROM (
  SELECT DISTINCT ON (user_id) posts.*  
  FROM posts 
  ORDER BY user_id, published_at DESC
) post
ORDER  BY category DESC, created_at;

The easiest options to achieve this in Rails ActiveRecord is to use "WHERE ID IN" with

User.where(id: Post.distinct_on_user.pluck(:user_id)).order(...)

Further ressources

Felix Eschey Over 1 year ago