Read more

Matching unicode characters in a Ruby (1.9+) regexp

Tobias Kraze
August 28, 2015Software engineer at makandra GmbH

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII
Illustration online protection

Rails professionals since 2007

Our laser focus on a single technology has made us a leader in this space. Need help?

  • We build a solid first version of your product
  • We train your development team
  • We rescue your project in trouble
Read more Show archive.org snapshot

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

"füü" =~ /[[:alpha:]]+/   # matches "füü"
"‎१४" =~ /[[:digit:]]+/    # matches "१४"
Posted by Tobias Kraze to makandra dev (2015-08-28 10:20)