Matching Unicode characters in a Ruby (1.9+) regexp

Updated . Posted . Visible to the public. Repeats.

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

"füü" =~ /[[:alpha:]]+/   # matches "füü"
"‎१४" =~ /[[:digit:]]+/    # matches "१४"

See also

Profile picture of Tobias Kraze
Tobias Kraze
Last edit
Michael Leimstädtner
License
Source code in this card is licensed under the MIT License.
Posted by Tobias Kraze to makandra dev (2015-08-28 08:20)