Ruby: Natural sort strings with Umlauts and other funny characters

Updated . Posted . Visible to the public.

Why string sorting sucks in vanilla Ruby

Ruby's sort method doesn't work as expected with special characters (like German umlauts):

["Schwertner", "Schöler"].sort
# => ["Schwertner", "Schöler"] # you probably expected ["Schöler", "Schwertner"]

Also numbers in strings will be sorted character by character which you probably don't want:

["1", "2", "11"].sort
# => ["1", "11", "2"] # you probably expected ["1", "2", "11"]

Also the sorting is case sensitive:

["a", "B"].sort
# => ["B", "a"] # you probably expected ["a", "B"]

How to fix it

To fix all of this copy the attached files to config/initializers. It gives your strings a method #to_sort_atoms that returns an object that compares as expected.

You can now say:

["Schwertner", "Schöler"].sort_by(&:to_sort_atoms) #=> ["Schöler", "Schwertner"]
["1", "2", "11"].sort_by(&:to_sort_atoms) # => ["1", "2", "11"]
["a", "B"].sort_by(&:to_sort_atoms) # => ["a", "B"]

There is also a shortcut #natural_sort that does roughly the same as sort_by(&:to_sort_atoms):

["Schwertner", "Schöler"].natural_sort #=> ["Schöler", "Schwertner"]
["1", "2", "11"].natural_sort # => ["1", "2", "11"]
["a", "B"].natural_sort # => ["a", "B"]

In additional natural_sort will look for a method #to_sort_atoms on non-strings so you can define your own natural sort order.

There is also natural_sort_by which works like Ruby's sort_by(&block).

Tweaking for weird requirements

You can configure the string normalization as described in "Normalize characters in Ruby".

Specs (for nerds)

Here are some specs that describe the behavior of #to_sort_atoms:

describe String do

  describe '#to_sort_atoms' do
    it 'should return an object that correctly compares German umlauts' do
      expect('Äa'.to_sort_atoms <=> 'Az'.to_sort_atoms).to eq -1
      expect('Äa'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 0
      expect('Az'.to_sort_atoms <=> 'Ää'.to_sort_atoms).to eq 1

    it 'should return an object that compares case insensitively' do
      expect('A'.to_sort_atoms <=> 'b'.to_sort_atoms).to eq -1
      expect('A'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 0
      expect('B'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq 1

    it 'should return an object that compares naturally' do
      expect('2'.to_sort_atoms <=> '11'.to_sort_atoms).to eq -1
      expect('2'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 0
      expect('11'.to_sort_atoms <=> '2'.to_sort_atoms).to eq 1

    it 'should compare correctly when the left and right side have different number of atoms' do
      expect('a1b1c1'.to_sort_atoms <=> 'b1'.to_sort_atoms).to eq -1
      expect('a1b1c1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 0
      expect('b1'.to_sort_atoms <=> 'a1b1c1'.to_sort_atoms).to eq 1

    it 'should compare correctly when the left side starts with a digit' do
      expect('1'.to_sort_atoms <=> 'a'.to_sort_atoms).to eq -1
      expect('1'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 0
      expect('a'.to_sort_atoms <=> '1'.to_sort_atoms).to eq 1

Henning Koch
Last edit
Fabian Schwarz
Source code in this card is licensed under the MIT License.
Posted by Henning Koch to makandra dev (2012-06-15 12:32)