On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:
"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII
There is a collection of character classes that will match unicode characters. From the documentation:
- 
/[[:alnum:]]/Alphabetic and numeric character - 
/[[:alpha:]]/Alphabetic character - 
/[[:blank:]]/Space or tab - 
/[[:cntrl:]]/Control character - 
/[[:digit:]]/Digit - 
/[[:graph:]]/Non-blank character (excludes spaces, control characters, and similar) - 
/[[:lower:]]/Lowercase alphabetical character - 
/[[:print:]]/Like [[:graph:]], but includes the space character - 
/[[:punct:]]/Punctuation character - 
/[[:space:]]/Whitespace character ([[:blank:]], newline, carriage return, etc.) - 
/[[:upper:]]/Uppercase alphabetical - 
/[[:xdigit:]]/Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) - 
/[[:word:]]/A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation - 
/[[:ascii:]]/A character in the ASCII character set 
So:
"füü" =~ /[[:alpha:]]+/   # matches "füü"
"१४" =~ /[[:digit:]]+/    # matches "१४"
See also
Posted by Tobias Kraze to makandra dev (2015-08-28 08:20)