Regular Expressions: Space Separators

Updated . Posted . Visible to the public. Repeats.

Matching the "space" character class

For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s.
It matches the following whitespace characters:

  • " " (space)
  • \n (newline)
  • \r (carriage return)
  • \t (tab)
  • \f (form feed/page break)

However, in some cases these may not be good enough for your purpose.

Non-breaking spaces (nbsp)

Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0 or \xC2\xA0. Due to the dependency on the encoding, \s does not match non-breaking spaces. However, there are two options you can use.

If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s.

# => true

# => true

" ".match?(/[[:space:]]/)
# => true

# => true

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Unicode character class for space separator

As an alternative to the POSIX character class, you may also use a unicode character classes. The shorthand expression for the space separator character class is \p{Zs}. This matches all kinds of spaces that take up space, but no tabs or line feeds ([\n\r\t\f]) and no zero-width spaces:

# => true

# => true

" ".match?(/\p{Zs}/)
# => true

# => false

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Convert Encoding to UTF-8

If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.

Bruno Sedler
Last edit
Henning Koch
Source code in this card is licensed under the MIT License.
Posted by Bruno Sedler to makandra dev (2021-05-12 07:07)