Read more

How to prevent Nokogiri from fixing invalid HTML

Arne Hartherz
July 10, 2020Software engineer at makandra GmbH

Nokogiri is great. It will even fix invalid HTML for you, like a browser would (e.g. move block elements out of parents which are specified to not allow them).

>> Nokogiri::HTML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1></h1><p>foo</p><span>bar</span>"
Illustration online protection

Rails Long Term Support

Rails LTS provides security patches for old versions of Ruby on Rails (2.3, 3.2, 4.2 and 5.2)

  • Prevents you from data breaches and liability risks
  • Upgrade at your own pace
  • Works with modern Rubies
Read more Show archive.org snapshot

While this is mostly useful, browsers are actually fine with a bit of badly formatted HTML. And you don't want to be the one to blame when the SEO folks complain about an empty <h1>.

To avoid said behavior, use Nokogiri::XML instead of Nokogiri::HTML when parsing your HTML string. As long as tags are closing in the correct order, you should be fine.

However, when converting back to an HTML string, you may end up with extra linebreaks that were not there before.

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1>\n  <p>foo</p>\n  <span>bar</span>\n</h1>"

In HTML, line breaks between elements are meaningful (they are whitespace which may appear between inline elements), so you would probably prefer not having them.

Instead of to_s, use to_xml and pass some SaveOptions Show archive.org snapshot .

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_xml(save_with: Nokogiri::XML::Node::SaveOptions::AS_HTML)
=> "<h1><p>foo</p><span>bar</span></h1>"

If the input HTML contains a line break (somewhere inside the document, not at the beginning or end), the behavior of to_s magically changes to the above.

>> Nokogiri::XML.fragment("<h1><p>foo</p>\n<span>bar</span></h1>").to_s
=> "<h1><p>foo</p>\n<span>bar</span></h1>"

While using to_s would be fine in such cases, I suggest you keep using the more explicit to_xml(save_with: ...AS_HTML).

Final note: Nokogiri's XML nodes (the Ruby objects) behave a bit differently than HTML nodes. For example, use name instead of tag when reading or modifying the document.

Posted by Arne Hartherz to makandra dev (2020-07-10 22:08)