On Ruby 1.9+, standard ruby character classes like \w
, \d
will only match 7-Bit ASCII characters:
"foo" =~ /\w+/ # matches "foo"
"füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII
There is a collection of character classes that will match unicode characters. From the documentation:
-
/[[:alnum:]]/
Alphabetic and numeric character -
/[[:alpha:]]/
Alphabetic character -
/[[:blank:]]/
Space or tab -
/[[:cntrl:]]/
Control character -
/[[:digit:]]/
Digit -
/[[:graph:]]/
Non-blank character (excludes spaces, control characters, and similar) -
/[[:lower:]]/
Lowercase alphabetical character -
/[[:print:]]/
Like [[:graph:]], but includes the space character -
/[[:punct:]]/
Punctuation character -
/[[:space:]]/
Whitespace character ([[:blank:]], newline, carriage return, etc.) -
/[[:upper:]]/
Uppercase alphabetical -
/[[:xdigit:]]/
Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F) -
/[[:word:]]/
A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation -
/[[:ascii:]]/
A character in the ASCII character set
So:
"füü" =~ /[[:alpha:]]+/ # matches "füü"
"१४" =~ /[[:digit:]]+/ # matches "१४"
Posted by Tobias Kraze to makandra dev (2015-08-28 08:20)