Read more

Ruby: Natural sort strings with Umlauts and other funny characters

Henning Koch
June 15, 2012Software engineer at makandra GmbH

Why string sorting sucks in vanilla Ruby

Ruby's sort method doesn't work as expected with special characters (like German umlauts):

["Schwertner", "Schöler"].sort
# => ["Schwertner", "Schöler"] # you probably expected ["Schöler", "Schwertner"]
Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

Also numbers in strings will be sorted character by character which you probably don't want:

["1", "2", "11"].sort
# => ["1", "11", "2"] # you probably expected ["1", "2", "11"]

Also the sorting is case sensitive:

["a", "B"].sort
# => ["B", "a"] # you probably expected ["a", "B"]

How to fix it

To fix all of this copy the attached files to config/initializers. It gives your strings a method #to_sort_atoms that returns an object that compares as expected.

You can now say:

["Schwertner", "Schöler"].sort_by(&:to_sort_atoms) #=> ["Schöler", "Schwertner"]
["1", "2", "11"].sort_by(&:to_sort_atoms) # => ["1", "2", "11"]
["a", "B"].sort_by(&:to_sort_atoms) # => ["a", "B"]

There is also a shortcut #natural_sort that does roughly the same as sort_by(&:to_sort_atoms):

["Schwertner", "Schöler"].natural_sort #=> ["Schöler", "Schwertner"]
["1", "2", "11"].natural_sort # => ["1", "2", "11"]
["a", "B"].natural_sort # => ["a", "B"]

In additional natural_sort will look for a method #to_sort_atoms on non-strings so you can define your own natural sort order.

There is also natural_sort_by which works like Ruby's sort_by(&block).

Tweaking for weird requirements

You can configure the string normalization as described in "Normalize characters in Ruby".

Specs (for nerds)

Here are some specs that describe the behavior of #to_sort_atoms:

describe String do

  describe '#to_sort_atoms' do
    it 'should return an object that correctly compares German umlauts' do
      expect('Äa'.to_sort_atoms <=> 'Az'.to_sort_atoms).to eq -1
      expect('Äa'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 0
      expect('Az'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares case insensitively' do
      expect('A'.to_sort_atoms <=> 'b'.to_sort_atoms).to eq -1
      expect('A'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 0
      expect('B'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares naturally' do
      expect('2'.to_sort_atoms <=> '11'.to_sort_atoms).to eq -1
      expect('2'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 0
      expect('11'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left and right side have different number of atoms' do
      expect('a1b1c1'.to_sort_atoms <=> 'b1'.to_sort_atoms).to eq -1
      expect('a1b1c1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 0
      expect('b1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left side starts with a digit' do
      expect('1'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq -1
      expect('1'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 0
      expect('a'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 1
    end
  end

end
Henning Koch
June 15, 2012Software engineer at makandra GmbH
Posted by Henning Koch to makandra dev (2012-06-15 14:32)