How to prevent Nokogiri from fixing invalid HTML

Nokogiri is great. It will even fix invalid HTML for you, like a browser would (e.g. move block elements out of parents which are specified to not allow them).

>> Nokogiri::HTML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1></h1><p>foo</p><span>bar</span>"

While this is mostly useful, browsers are actually fine with a bit of badly formatted HTML. And you don't want to be the one to blame when the SEO folks complain about an empty <h1>.

To avoid said behavior, use Nokogiri::XML instead of Nokogiri::HTML when parsing your HTML string. As long as tags are closing in the correct order, you should be fine.

However, when converting back to an HTML string, you may end up with extra linebreaks that were not there before.

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1>\n  <p>foo</p>\n  <span>bar</span>\n</h1>"

In HTML, line breaks between elements are meaningful (they are whitespace which may appear between inline elements), so you would probably prefer not having them.

Instead of to_s, use to_xml and pass some SaveOptions Show archive.org snapshot .

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_xml(save_with: Nokogiri::XML::Node::SaveOptions::AS_HTML)
=> "<h1><p>foo</p><span>bar</span></h1>"

If the input HTML contains a line break (somewhere inside the document, not at the beginning or end), the behavior of to_s magically changes to the above.

>> Nokogiri::XML.fragment("<h1><p>foo</p>\n<span>bar</span></h1>").to_s
=> "<h1><p>foo</p>\n<span>bar</span></h1>"

While using to_s would be fine in such cases, I suggest you keep using the more explicit to_xml(save_with: ...AS_HTML).

Final note: Nokogiri's XML nodes (the Ruby objects) behave a bit differently than HTML nodes. For example, use name instead of tag when reading or modifying the document.

Arne Hartherz Over 3 years ago