Posted 7 months ago. Visible to the public. Repeats.

Regular Expressions: Space Separators

Matching the "space" character class

For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s.
It matches the following whitespace characters:

  • " " (space)
  • \n (newline)
  • \r (carriage return)
  • \t (tab)
  • \f (form feed/page break)

However, in some cases these may not be good enough for your purpose.

Non-breaking spaces (nbsp)

Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0 or \xC2\xA0. Due to the dependency on the encoding, \s does not match non-breaking spaces. However, there are two options you can use.

If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s.

Copy
"\xC2\xA0".match?(/[[:space:]]/) # => true "\u00A0".match?(/[[:space:]]/) # => true " ".match?(/[[:space:]]/) # => true "\n".match?(/[[:space:]]/) # => true nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/[[:space:]]/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Unicode character class for space separator

The unicode character classes also only work for UTF-8. The shorthand expression for the space separator character class is \p{Zs}. This matches nbsps encoded in UTF-8. But be aware, that it does not match [\n\r\t\f]

Copy
"\xC2\xA0".match?(/\p{Zs}/) # => true "\u00A0".match?(/\p{Zs}/) # => true " ".match?(/\p{Zs}/) # => true "\n".match?(/\p{Zs}/) # => false nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/\p{Zs}/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Convert Encoding to UTF-8

If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.

makandra has been working exclusively with Ruby on Rails since 2007. Our laser focus on a single technology has made us a leader in this space.

Owner of this card:

Avatar
Bruno Sedler
Last edit:
6 months ago
by Bruno Sedler
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Bruno Sedler to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more