Posted over 4 years ago. Visible to the public. Repeats.

Matching unicode characters in a Ruby (1.9+) regexp

On Ruby 1.9+, standard ruby character classes like \w, \d will only match 7-Bit ASCII characters:

Copy
"foo" =~ /\w+/ # matches "foo" "füü" =~ /\w+/ # matches "f", ü is not 7-Bit ASCII

There is a collection of character classes that will match unicode characters. From the documentation:

  • /[[:alnum:]]/ Alphabetic and numeric character
  • /[[:alpha:]]/ Alphabetic character
  • /[[:blank:]]/ Space or tab
  • /[[:cntrl:]]/ Control character
  • /[[:digit:]]/ Digit
  • /[[:graph:]]/ Non-blank character (excludes spaces, control characters, and similar)
  • /[[:lower:]]/ Lowercase alphabetical character
  • /[[:print:]]/ Like [[:graph:]], but includes the space character
  • /[[:punct:]]/ Punctuation character
  • /[[:space:]]/ Whitespace character ([[:blank:]], newline, carriage return, etc.)
  • /[[:upper:]]/ Uppercase alphabetical
  • /[[:xdigit:]]/ Digit allowed in a hexadecimal number (i.e., 0-9a-fA-F)
  • /[[:word:]]/ A character in one of the following Unicode general categories Letter, Mark, Number, Connector_Punctuation
  • /[[:ascii:]]/ A character in the ASCII character set

So:

Copy
"füü" =~ /[[:alpha:]]+/ # matches "füü" "‎१४" =~ /[[:digit:]]+/ # matches "१४"

By refactoring problematic code and creating automated tests, makandra can vastly improve the maintainability of your Rails application.

Owner of this card:

Avatar
Tobias Kraze
Last edit:
about 3 years ago
by Arne Hartherz
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Tobias Kraze to makandra dev
This website uses cookies to improve usability and analyze traffic.
Accept or learn more