Posted about 4 years ago. Visible to the public. Repeats.

Matching unicode characters in a Ruby (1.9+) regexp

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

Copy
"foo" =~ /\w+/ # matches "foo" "füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

Copy
"füü" =~ /[[:alpha:]]+/ # matches "füü" "‎१४" =~ /[[:digit:]]+/ # matches "१४"
Growing Rails Applications in Practice
Check out our new e-book:
Learn to structure large Ruby on Rails codebases with the tools you already know and love.

Owner of this card:

Avatar
Tobias Kraze
Last edit:
over 2 years ago
by Arne Hartherz
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Tobias Kraze to makandra dev
This website uses cookies to improve usability and analyze traffic.
Accept or learn more