Posted over 4 years ago. Visible to the public. Repeats.

PSA: Umlauts are not always what they seem to be

When you have a string containing umlauts which don't behave as expected (are not matched with a regexp, can't be found with an SQL query, do not print correctly on LaTeX documents, etc), you may be encountering umlauts which are not actually umlaut characters.

They look, depending on the font, like their "real" umlaut counterpart:

  • ä ↔ ä
  • ö ↔ ö
  • ü ↔ ü

However, they are not the same:

Copy
>> 'ä' == 'ä' => false >> 'ä'.size => 1 >> 'ä'.size => 2

Looking at how those strings are constructed reveals what is going on:

Copy
>> 'ä'.unpack('U*') => [228] >> 'ä'.unpack('U*') => [97, 776]

As you can see, the 2nd representation is actually a combination of "a" (character code 97) and a negative-space unicode character which puts the dots on top of it.

This is terrible.

Ruby 2.2 is your friend

Ruby 2.2 brings String#unicode_normalize and its bang and query brothers.

Copy
2.2.0 :001 > 'ä' == 'ä' => false 2.2.0 :002 > 'ä' == 'ä'.unicode_normalize => true

Also see

The Big List of Naughty Strings is an evolving list of strings which have a high probability of causing issues when used as user-input data.

Growing Rails Applications in Practice
Check out our new e-book:
Learn to structure large Ruby on Rails codebases with the tools you already know and love.

Owner of this card:

Avatar
Arne Hartherz
Last edit:
about 3 years ago
by Henning Koch
Keywords:
unicode, mimicry, utf-8
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Arne Hartherz to makandra dev
This website uses cookies to improve usability and analyze traffic.
Accept or learn more