Read more

How to prevent Nokogiri from fixing invalid HTML

Arne Hartherz
July 10, 2020Software engineer at makandra GmbH

Nokogiri is great. It will even fix invalid HTML for you, like a browser would (e.g. move block elements out of parents which are specified to not allow them).

>> Nokogiri::HTML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1></h1><p>foo</p><span>bar</span>"
Illustration web development

Do you need DevOps-experts?

Your development team has a full backlog? No time for infrastructure architecture? Our DevOps team is ready to support you!

  • We build reliable cloud solutions with Infrastructure as code
  • We are experts in security, Linux and databases
  • We support your dev team to perform
Read more Show archive.org snapshot

While this is mostly useful, browsers are actually fine with a bit of badly formatted HTML. And you don't want to be the one to blame when the SEO folks complain about an empty <h1>.

To avoid said behavior, use Nokogiri::XML instead of Nokogiri::HTML when parsing your HTML string. As long as tags are closing in the correct order, you should be fine.

However, when converting back to an HTML string, you may end up with extra linebreaks that were not there before.

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1>\n  <p>foo</p>\n  <span>bar</span>\n</h1>"

In HTML, line breaks between elements are meaningful (they are whitespace which may appear between inline elements), so you would probably prefer not having them.

Instead of to_s, use to_xml and pass some SaveOptions Show archive.org snapshot .

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_xml(save_with: Nokogiri::XML::Node::SaveOptions::AS_HTML)
=> "<h1><p>foo</p><span>bar</span></h1>"

If the input HTML contains a line break (somewhere inside the document, not at the beginning or end), the behavior of to_s magically changes to the above.

>> Nokogiri::XML.fragment("<h1><p>foo</p>\n<span>bar</span></h1>").to_s
=> "<h1><p>foo</p>\n<span>bar</span></h1>"

While using to_s would be fine in such cases, I suggest you keep using the more explicit to_xml(save_with: ...AS_HTML).

Final note: Nokogiri's XML nodes (the Ruby objects) behave a bit differently than HTML nodes. For example, use name instead of tag when reading or modifying the document.

Posted by Arne Hartherz to makandra dev (2020-07-10 22:08)