Matching the "space" character class
For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s
.
It matches the following whitespace characters:
- " " (space)
- \n (newline)
- \r (carriage return)
- \t (tab)
- \f (form feed/page break)
However, in some cases these may not be good enough for your purpose.
Non-breaking spaces (nbsp)
Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0
or \xC2\xA0
. Due to the dependency on the encoding, \s
does not match non-breaking spaces. However, there are two options you can use.
POSIX character class for space separator (recommended)
If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/
. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s
.
"\xC2\xA0".match?(/[[:space:]]/)
# => true
"\u00A0".match?(/[[:space:]]/)
# => true
" ".match?(/[[:space:]]/)
# => true
"\n".match?(/[[:space:]]/)
# => true
nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/[[:space:]]/)
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))
Unicode character class for space separator
As an alternative to the POSIX character class, you may also use a unicode character classes. The shorthand expression for the space separator character class is \p{Zs}
. This matches all kinds of spaces that take up space, but no tabs or line feeds ([\n\r\t\f]
) and no zero-width spaces:
"\xC2\xA0".match?(/\p{Zs}/)
# => true
"\u00A0".match?(/\p{Zs}/)
# => true
" ".match?(/\p{Zs}/)
# => true
"\n".match?(/\p{Zs}/)
# => false
nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1)
# => "\xA0"
nbsp.match?(/\p{Zs}/)
# => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))
Convert Encoding to UTF-8
If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.