Posted almost 2 years ago. Visible to the public. Repeats.

Matching unicode characters in a Ruby (1.9+) regexp

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

Copy
"foo" =~ /\w+/ # matches "foo" "füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

Copy
"füü" =~ /[[:alpha:]]+/ # matches "füü" "‎१४" =~ /[[:digit:]]+/ # matches "१४"

Once an application no longer requires constant development, it needs periodic maintenance for stable and secure operation. makandra offers monthly maintenance contracts that let you focus on your business while we make sure the lights stay on.

Author of this card:

Avatar
Tobias Kraze
Last edit:
6 months ago
by Arne Hartherz
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Tobias Kraze to makandropedia