Regular Expressions: Space Separators

Matching the "space" character class

For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s.
It matches the following whitespace characters:

" " (space)
\n (newline)
\r (carriage return)
\t (tab)
\f (form feed/page break)

However, in some cases these may not be good enough for your purpose.

Non-breaking spaces (nbsp)

Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0 or \xC2\xA0. Due to the dependency on the encoding, \s does not match non-breaking spaces. However, there are two options you can use.

POSIX character class for space separator (recommended)

If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s.

"\xC2\xA0".match?(/[[:space:]]/)
# => true

"\u00A0".match?(/[[:space:]]/)
# => true

" ".match?(/[[:space:]]/)
# => true

"\n".match?(/[[:space:]]/)
# => true

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/[[:space:]]/) 
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Unicode character class for space separator

As an alternative to the POSIX character class, you may also use a unicode character classes. The shorthand expression for the space separator character class is \p{Zs}. This matches all kinds of spaces that take up space, but no tabs or line feeds ([\n\r\t\f]) and no zero-width spaces:

"\xC2\xA0".match?(/\p{Zs}/)
# => true

"\u00A0".match?(/\p{Zs}/)
# => true

" ".match?(/\p{Zs}/)
# => true

"\n".match?(/\p{Zs}/)
# => false

nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/\p{Zs}/) 
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))

Convert Encoding to UTF-8

If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.

Bruno Sedler

Say thanks3

Last edit

2023-11-13

Henning Koch

License

Source code in this card is licensed under the MIT License.

Posted by Bruno Sedler to makandra dev (2021-05-12 07:07)