On Ruby 1.9+, standard ruby character classes like \w
, \d
will only match 7-Bit ASCII characters:
"foo" =~ /\w+/ # matches "foo"
"füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII
There is a collection of character classes that will match unicode characters. From the documentation:
/[[:alnum:]]/
Alphabetic and numeric character/[[:alpha:]]/
Alphabetic character/[[:blank:]]/
Space or tab/[[:cntrl:]]/
Control character/[[:digit:]]/
Digit/[[:graph:]]/
Non-blank character (excludes spaces, control characters, and similar)/[[:lower:]]/
Lowercase alphabetical character/[[:print:]]/
Like [[:graph:]], but includes the space character/[[:punct:]]/
Punctuation character/[[:space:]]/
Whitespace character ([[:blank:]], newline, carriage return, etc.)/[[:upper:]]/
Uppercase alphabetical/[[:xdigit:]]/
Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)/[[:word:]]/
A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation/[[:ascii:]]/
A character in the ASCII character setSo:
"füü" =~ /[[:alpha:]]+/ # matches "füü"
"१४" =~ /[[:digit:]]+/ # matches "१४"