Posted about 1 year ago. Visible to the public. Repeats.

Regular Expressions: Space Separators

Matching the "space" character class

For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s.
It matches the following whitespace characters:

  • " " (space)
  • \n (newline)
  • \r (carriage return)
  • \t (tab)
  • \f (form feed/page break)

However, in some cases these may not be good enough for your purpose.

Non-breaking spaces (nbsp)

Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0 or \xC2\xA0. Due to the dependency on the encoding, \s does not match non-breaking spaces. However, there are two options you can use.

If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s.

Copy
"\xC2\xA0".match?(/[[:space:]]/) # => true "\u00A0".match?(/[[:space:]]/) # => true " ".match?(/[[:space:]]/) # => true "\n".match?(/[[:space:]]/) # => true nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/[[:space:]]/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Unicode character class for space separator

The unicode character classes also only work for UTF-8. The shorthand expression for the space separator character class is \p{Zs}. This matches nbsps encoded in UTF-8. But be aware, that it does not match [\n\r\t\f]

Copy
"\xC2\xA0".match?(/\p{Zs}/) # => true "\u00A0".match?(/\p{Zs}/) # => true " ".match?(/\p{Zs}/) # => true "\n".match?(/\p{Zs}/) # => false nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/\p{Zs}/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Convert Encoding to UTF-8

If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.

Does your version of Ruby on Rails still receive security updates?
Rails LTS provides security patches for unsupported versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2).

Owner of this card:

Avatar
Bruno Sedler
Last edit:
about 1 year ago
by Bruno Sedler
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Bruno Sedler to makandra dev
This website uses short-lived cookies to improve usability.
Accept or learn more