How to prevent Nokogiri from fixing invalid HTML

Nokogiri is great. It will even fix invalid HTML for you, like a browser would (e.g. move block elements out of parents which are specified to not allow them).

>> Nokogiri::HTML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1></h1><p>foo</p><span>bar</span>"

While this is mostly useful, browsers are actually fine with a bit of badly formatted HTML. And you don't want to be the one to blame when the SEO guy complains about an empty <h1>.

To avoid said behavior, use Nokogiri::XML instead of Nokogiri::HTML when parsing your HTML string. As long as tags are closing in the correct order, you should be fine.

However, when converting back to an HTML string, you may end up with extra linebreaks that were not there before.

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1>\n  <p>foo</p>\n  <span>bar</span>\n</h1>"

In HTML, line breaks between elements are meaningful (they are whitespace which may appear between inline elements), so you would probably prefer not having them. Instead of to_s, use to_xml and pass some SaveOptions.

>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_xml(save_with: Nokogiri::XML::Node::SaveOptions::AS_HTML)
=> "<h1><p>foo</p><span>bar</span></h1>"

Note that the behavior of to_s changes to what you want when the input HTML contains a line break (not at the beginning or end, though). In such cases, you could use to_s just fine.

>> Nokogiri::XML.fragment("<h1><p>foo</p>\n<span>bar</span></h1>").to_s
=> "<h1><p>foo</p>\n<span>bar</span></h1>"

Final note: XML nodes behave a bit differently than HTML nodes. For example, use name instead of tag when reading or modifying the document.

Arne Hartherz 5 months ago
This website uses short-lived cookies to improve usability.
Accept or learn more