PSA: Umlauts are not always what they seem to be
When you have a string containing umlauts which don't behave as expected (are not matched with a regexp, can't be found with an SQL query, do not print correctly on LaTeX documents, etc), you may be encountering umlauts which are not actually umlaut characters.
They look, depending on the font, like their "real" umlaut counterpart:
- ä ↔ ä
- ö ↔ ö
- ü ↔ ü
However, they are not the same:
Copy>> 'ä' == 'ä' => false >> 'ä'.size => 1 >> 'ä'.size => 2
Looking at how those strings are constructed reveals what is going on:
Copy>> 'ä'.unpack('U*') => [228] >> 'ä'.unpack('U*') => [97, 776]
As you can see, the 2nd representation is actually a combination of "a" (character code 97) and a negative-space unicode character which puts the dots on top of it Archive .
This is terrible.
Ruby 2.2 is your friend
Ruby 2.2 brings String#unicode_normalize Archive and its bang and query brothers.
Copy2.2.0 :001 > 'ä' == 'ä' => false 2.2.0 :002 > 'ä' == 'ä'.unicode_normalize => true
Also see
The Big List of Naughty Strings Archive is an evolving list of strings which have a high probability of causing issues when used as user-input data.
Does your version of Ruby on Rails still receive security updates?
Rails LTS provides security patches for unsupported versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2).