Regular Expressions: Space Separators
Matching the "space" character class
For matching whitespaces in a regular expression, the most common and best-known shorthand expression is probably \s
.
It matches the following whitespace characters:
- " " (space)
- \n (newline)
- \r (carriage return)
- \t (tab)
- \f (form feed/page break)
However, in some cases these may not be good enough for your purpose.
Non-breaking spaces (nbsp)
Sometimes a text may contain two words separated by a space, but the author wanted to ensure that those words are written in the same line. In such cases a non-breaking space character is used. The representation of non-breaking spaces in a text depends on the encoding. In UTF-8, it can be represented by \u00A0
or \xC2\xA0
. Due to the dependency on the encoding, \s
does not match non-breaking spaces. However, there are two options you can use.
POSIX character class for space separator (recommended)
If you know your text is valid UTF-8, you can use the POSIX character classes. The shorthand expression for the space separator character class is /[[:space:]]/
. This is the recommended way, as it matches nbsps encoded in UTF-8 and also all whitespace characters matched by \s
.
Copy"\xC2\xA0".match?(/[[:space:]]/) # => true "\u00A0".match?(/[[:space:]]/) # => true " ".match?(/[[:space:]]/) # => true "\n".match?(/[[:space:]]/) # => true nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/[[:space:]]/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))
Unicode character class for space separator
The unicode character classes also only work for UTF-8. The shorthand expression for the space separator character class is \p{Zs}
. This matches nbsps encoded in UTF-8. But be aware, that it does not match [\n\r\t\f]
Copy"\xC2\xA0".match?(/\p{Zs}/) # => true "\u00A0".match?(/\p{Zs}/) # => true " ".match?(/\p{Zs}/) # => true "\n".match?(/\p{Zs}/) # => false nbsp = "\xC2\xA0".encode!(Encoding::ISO_8859_1) # => "\xA0" nbsp.match?(/\p{Zs}/) # => Encoding::CompatibilityError (incompatible encoding regexp match (UTF-8 regexp with ISO-8859-1 string))
Convert Encoding to UTF-8
If you want to use the Unicode or POSIX space separator character class but cannot be sure that a text is encoded in UTF-8, this card may help you.
Does your version of Ruby on Rails still receive security updates?
Rails LTS provides security patches for unsupported versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2).