On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:
"foo" =~ /\w+/ # matches "foo"
"füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII
There is a collection of character classes that will match unicode characters. From the documentation:
-
/[[:alnum:]]/Alphabetic and numeric character -
/[[:alpha:]]/Alphabetic character -
/[[:blank:]]/Space or tab -
/[[:cntrl:]]/Control character -
/[[:digit:]]/Digit -
/[[:graph:]]/Non-blank character (excludes spaces, control characters, and similar) -
/[[:lower:]]/Lowercase alphabetical character -
/[[:print:]]/Like [[:graph:]], but includes the space character -
/[[:punct:]]/Punctuation character -
/[[:space:]]/Whitespace character ([[:blank:]], newline, carriage return, etc.) -
/[[:upper:]]/Uppercase alphabetical -
/[[:xdigit:]]/Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) -
/[[:word:]]/A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation -
/[[:ascii:]]/A character in the ASCII character set
So:
"füü" =~ /[[:alpha:]]+/ # matches "füü"
"१४" =~ /[[:digit:]]+/ # matches "१४"
See also
Posted by Tobias Kraze to makandra dev (2015-08-28 08:20)