Read more

Matching unicode characters in a Ruby (1.9+) regexp

Tobias Kraze
August 28, 2015Software engineer at makandra GmbH

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

"foo" =~ /\w+/   # matches "foo"
"füü" =~ /\w+/   # matches "f", ü is not 7-Bit ASCII
Illustration money motivation

Opscomplete powered by makandra brand

Save money by migrating from AWS to our fully managed hosting in Germany.

  • Trusted by over 100 customers
  • Ready to use with Ruby, Node.js, PHP
  • Proactive management by operations experts
Read more Show archive.org snapshot

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

"füü" =~ /[[:alpha:]]+/   # matches "füü"
"‎१४" =~ /[[:digit:]]+/    # matches "१४"
Posted by Tobias Kraze to makandra dev (2015-08-28 10:20)