Regular Expressions: Space Separators

Updated . Posted . Visible to the public. Repeats.

Matching the "space" character class

For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s.
It matches the following whitespace characters:

  • " " (space)
  • \n (newline)
  • \r (carriage return)
  • \t (tab)
  • \f (form feed/page break)

However, in some cases these may not be good enough for your purpose.

Non-breaking spaces (nbsp)

Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0 or \xC2\xA0. Due to the dependency on the encoding, \s does not match non-breaking spaces. However, there are two options you can use.

If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s.

"\xC2\xA0".match?(/[[:space:]]/)
# => true

"\u00A0".match?(/[[:space:]]/)
# => true

" ".match?(/[[:space:]]/)
# => true

"\n".match?(/[[:space:]]/)
# => true

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/[[:space:]]/) 
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Unicode character class for space separator

As an alternative to the POSIX character class, you may also use a unicode character classes. The shorthand expression for the space separator character class is \p{Zs}. This matches all kinds of spaces that take up space, but no tabs or line feeds ([\n\r\t\f]) and no zero-width spaces:

"\xC2\xA0".match?(/\p{Zs}/)
# => true

"\u00A0".match?(/\p{Zs}/)
# => true

" ".match?(/\p{Zs}/)
# => true

"\n".match?(/\p{Zs}/)
# => false

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/\p{Zs}/) 
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Convert Encoding to UTF-8

If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.

Bruno Sedler
Last edit
Henning Koch
License
Source code in this card is licensed under the MIT License.
Posted by Bruno Sedler to makandra dev (2021-05-12 07:07)