Read more

PSA: Umlauts are not always what they seem to be

Arne Hartherz
June 03, 2014Software engineer at makandra GmbH

When you have a string containing umlauts which don't behave as expected (are not matched with a regexp, can't be found with an SQL query, do not print correctly on LaTeX documents, etc), you may be encountering umlauts which are not actually umlaut characters.

Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot

They look, depending on the font, like their "real" umlaut counterpart:

  • ä ↔ ä
  • ö ↔ ö
  • ü ↔ ü

However, they are not the same:

>> 'ä' == 'ä'
=> false
>> 'ä'.size
=> 1
>> 'ä'.size
=> 2

Looking at how those strings are constructed reveals what is going on:

>> 'ä'.unpack('U*')
=> [228]
>> 'ä'.unpack('U*')
=> [97, 776]

As you can see, the 2nd representation is actually a combination of "a" (character code 97) and a negative-space unicode character which puts the dots on top of it Show archive.org snapshot .

This is terrible.

Ruby 2.2 is your friend

Ruby 2.2 brings String#unicode_normalize Show archive.org snapshot and its bang and query brothers.

2.2.0 :001 > 'ä' == 'ä'
 => false 
2.2.0 :002 > 'ä' == 'ä'.unicode_normalize
 => true 

Also see

The Big List of Naughty Strings Show archive.org snapshot is an evolving list of strings which have a high probability of causing issues when used as user-input data.

Posted by Arne Hartherz to makandra dev (2014-06-03 11:46)