Read more

Ruby: Natural sort strings with Umlauts and other funny characters

Avatar
Henning Koch
June 15, 2012Software engineer at makandra GmbH

Why string sorting sucks in vanilla Ruby

Ruby's sort method doesn't work as expected with German umlauts:

["Schwertner", "Schöler"].sort
=> ["Schwertner", "Schöler"] # you probably expected ["Schöler", "Schwertner"]
Illustration web development

Do you need DevOps-experts?

Your development team has a full backlog? No time for infrastructure architecture? Our DevOps team is ready to support you!

  • We build reliable cloud solutions with Infrastructure as code
  • We are experts in security, Linux and databases
  • We support your dev team to perform
Read more

Also numbers in strings will be sorted character by character which you probably don't want:

["1", "2", "11"].sort
# => ["1", "11", "2"] # you probably expected ["1", "2", "11"]

Also the sorting is case sensitive:

["a", "B"].sort
# => ["B", "a"] # you probably expected ["a", "B"]

How to fix it

To fix all of this copy the attached files to config/initializers. It gives your strings a method #to_sort_atoms that returns an object that compares as expected.

You can now say:

["Schwertner", "Schöler"].sort_by(&:to_sort_atoms) #=> ["Schöler", "Schwertner"]
["1", "2", "11"].sort_by(&:to_sort_atoms) # => ["1", "2", "11"]
["a", "B"].sort_by(&:to_sort_atoms) # => ["a", "B"]

There is also a shortcut #natural_sort that does roughly the same as sort_by(&:to_sort_atoms):

["Schwertner", "Schöler"].natural_sort #=> ["Schöler", "Schwertner"]
["1", "2", "11"].natural_sort # => ["1", "2", "11"]
["a", "B"].natural_sort # => ["a", "B"]

In additional natural_sort will look for a method #to_sort_atoms on non-strings so you can define your own natural sort order.

There is also natural_sort_by which works like Ruby's sort_by(&block).

Tweaking for weird requirements

You can configure the string normalization as described in "Normalize characters in Ruby".

Specs (for nerds)

Here are some specs that describe the behavior of #to_sort_atoms:

describe String do

  describe '#to_sort_atoms' do
    it 'should return an object that correctly compares German umlauts' do
      expect('Äa'.to_sort_atoms <=> 'Az'.to_sort_atoms).to eq -1
      expect('Äa'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 0
      expect('Az'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares case insensitively' do
      expect('A'.to_sort_atoms <=> 'b'.to_sort_atoms).to eq -1
      expect('A'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 0
      expect('B'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares naturally' do
      expect('2'.to_sort_atoms <=> '11'.to_sort_atoms).to eq -1
      expect('2'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 0
      expect('11'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left and right side have different number of atoms' do
      expect('a1b1c1'.to_sort_atoms <=> 'b1'.to_sort_atoms).to eq -1
      expect('a1b1c1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 0
      expect('b1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left side starts with a digit' do
      expect('1'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq -1
      expect('1'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 0
      expect('a'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 1
    end
  end

end
Avatar
Henning Koch
June 15, 2012Software engineer at makandra GmbH
Posted by Henning Koch to makandra dev (2012-06-15 14:32)