On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII

Opscomplete powered by

Save money by migrating from AWS to our fully managed hosting in Germany.

There is a collection of character classes that will match unicode characters. From the documentation:

/[[:alnum:]]/ Alphabetic and numeric character
/[[:alpha:]]/ Alphabetic character
/[[:blank:]]/ Space or tab
/[[:cntrl:]]/ Control character
/[[:digit:]]/ Digit
/[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
/[[:lower:]]/ Lowercase alphabetical character
/[[:print:]]/ Like [[:graph:]], but includes the space character
/[[:punct:]]/ Punctuation character
/[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
/[[:upper:]]/ Uppercase alphabetical
/[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
/[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
/[[:ascii:]]/ A character in the ASCII character set

So:

"füü" =~ /[[:alpha:]]+/   # matches "füü"
"‎१४" =~ /[[:digit:]]+/    # matches "१४"

Matching unicode characters in a Ruby (1.9+) regexp