Ruby: Natural sort strings with Umlauts and other funny characters

Updated . Posted . Visible to the public.

Why string sorting sucks in vanilla Ruby

Ruby's sort method doesn't work as expected with special characters (like German umlauts):

["Schwertner", "Schöler"].sort
# => ["Schwertner", "Schöler"] # you probably expected ["Schöler", "Schwertner"]

Also numbers in strings will be sorted character by character which you probably don't want:

["1", "2", "11"].sort
# => ["1", "11", "2"] # you probably expected ["1", "2", "11"]

Also the sorting is case sensitive:

["a", "B"].sort
# => ["B", "a"] # you probably expected ["a", "B"]

How to fix it

To fix all of this copy the attached files to config/initializers. It gives your strings a method #to_sort_atoms that returns an object that compares as expected.

You can now say:

["Schwertner", "Schöler"].sort_by(&:to_sort_atoms) #=> ["Schöler", "Schwertner"]
["1", "2", "11"].sort_by(&:to_sort_atoms) # => ["1", "2", "11"]
["a", "B"].sort_by(&:to_sort_atoms) # => ["a", "B"]

There is also a shortcut #natural_sort that does roughly the same as sort_by(&:to_sort_atoms):

["Schwertner", "Schöler"].natural_sort #=> ["Schöler", "Schwertner"]
["1", "2", "11"].natural_sort # => ["1", "2", "11"]
["a", "B"].natural_sort # => ["a", "B"]

In additional natural_sort will look for a method #to_sort_atoms on non-strings so you can define your own natural sort order.

There is also natural_sort_by which works like Ruby's sort_by(&block).

Tweaking for weird requirements

You can configure the string normalization as described in "Normalize characters in Ruby".

Specs (for nerds)

Here are some specs that describe the behavior of #to_sort_atoms:

describe String do

  describe '#to_sort_atoms' do
    it 'should return an object that correctly compares German umlauts' do
      expect('Äa'.to_sort_atoms <=> 'Az'.to_sort_atoms).to eq -1
      expect('Äa'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 0
      expect('Az'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares case insensitively' do
      expect('A'.to_sort_atoms <=> 'b'.to_sort_atoms).to eq -1
      expect('A'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 0
      expect('B'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 1
    end

    it 'should return an object that compares naturally' do
      expect('2'.to_sort_atoms <=> '11'.to_sort_atoms).to eq -1
      expect('2'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 0
      expect('11'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left and right side have different number of atoms' do
      expect('a1b1c1'.to_sort_atoms <=> 'b1'.to_sort_atoms).to eq -1
      expect('a1b1c1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 0
      expect('b1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 1
    end

    it 'should compare correctly when the left side starts with a digit' do
      expect('1'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq -1
      expect('1'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 0
      expect('a'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 1
    end
  end

end
Henning Koch
Last edit
Fabian Schwarz
Keywords
Umlaute
License
Source code in this card is licensed under the MIT License.
Posted by Henning Koch to makandra dev (2012-06-15 12:32)