On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:
"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII
There is a collection of character classes that will match unicode characters. From the documentation:
- 
/[[:alnum:]]/Alphabetic and numeric character
- 
/[[:alpha:]]/Alphabetic character
- 
/[[:blank:]]/Space or tab
- 
/[[:cntrl:]]/Control character
- 
/[[:digit:]]/Digit
- 
/[[:graph:]]/Non-blank character (excludes spaces, control characters, and similar)
- 
/[[:lower:]]/Lowercase alphabetical character
- 
/[[:print:]]/Like [[:graph:]], but includes the space character
- 
/[[:punct:]]/Punctuation character
- 
/[[:space:]]/Whitespace character ([[:blank:]], newline, carriage return, etc.)
- 
/[[:upper:]]/Uppercase alphabetical
- 
/[[:xdigit:]]/Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
- 
/[[:word:]]/A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
- 
/[[:ascii:]]/A character in the ASCII character set
So:
"füü" =~ /[[:alpha:]]+/   # matches "füü"
"१४" =~ /[[:digit:]]+/    # matches "१४"
See also
Posted by Tobias Kraze to makandra dev (2015-08-28 08:20)