Read more

PSA: Umlauts are not always what they seem to be

Arne Hartherz
June 03, 2014Software engineer at makandra GmbH

When you have a string containing umlauts which don't behave as expected (are not matched with a regexp, can't be found with an SQL query, do not print correctly on LaTeX documents, etc), you may be encountering umlauts which are not actually umlaut characters.

Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

They look, depending on the font, like their "real" umlaut counterpart:

  • ä ↔ ä
  • ö ↔ ö
  • ü ↔ ü

However, they are not the same:

>> 'ä' == 'ä'
=> false
>> 'ä'.size
=> 1
>> 'ä'.size
=> 2

Looking at how those strings are constructed reveals what is going on:

>> 'ä'.unpack('U*')
=> [228]
>> 'ä'.unpack('U*')
=> [97, 776]

As you can see, the 2nd representation is actually a combination of "a" (character code 97) and a negative-space unicode character which puts the dots on top of it Show archive.org snapshot .

This is terrible.

Ruby 2.2 is your friend

Ruby 2.2 brings String#unicode_normalize Show archive.org snapshot and its bang and query brothers.

2.2.0 :001 > 'ä' == 'ä'
 => false 
2.2.0 :002 > 'ä' == 'ä'.unicode_normalize
 => true 

Also see

The Big List of Naughty Strings Show archive.org snapshot is an evolving list of strings which have a high probability of causing issues when used as user-input data.

Posted by Arne Hartherz to makandra dev (2014-06-03 11:46)