Read more

Matching unicode characters in a Ruby (1.9+) regexp

Tobias Kraze
August 28, 2015Software engineer at makandra GmbH

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII
Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

"füü" =~ /[[:alpha:]]+/   # matches "füü"
"‎१४" =~ /[[:digit:]]+/    # matches "१४"
Posted by Tobias Kraze to makandra dev (2015-08-28 10:20)