Stripping all non-word-characters (a.k.a \W) and preserve diacritics (a.k.a Umlaute) in utf-8

Posted . Visible to the public.

Sometimes you want to strip a text of every special char. If you use \W, the result might not be what you wanted:

    irb> "Eine längliche Kristall §$&&%& Lampe über dem Esstisch verschönert jedes noch so kahle   \n  Speisezimmer".gsub(/\W+/, " ")
    => "Eine l ngliche Kristall Lampe ber dem Esstisch versch nert jedes noch so kahle Speisezimmer"

You need to use "[^[:word:]]", which is magically given to us by the spirit of the community Show archive.org snapshot

    irb> "Eine längliche Kristall §$&&%& Lampe über dem Esstisch verschönert jedes noch so kahle   \n  Speisezimmer".gsub(/[^[:word:]]+/, " ")
    => "Eine längliche Kristall Lampe über dem Esstisch verschönert jedes noch so kahle Speisezimmer"
Last edit
Keywords
i18n, umlaut
License
Source code in this card is licensed under the MIT License.
Posted by Lexy to makandra dev (2014-02-05 21:56)