Nokogiri is great. It will even fix invalid HTML for you, like a browser would (e.g. move block elements out of parents which are specified to not allow them).
>> Nokogiri::HTML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1></h1><p>foo</p><span>bar</span>"
While this is mostly useful, browsers are actually fine with a bit of badly formatted HTML. And you don't want to be the one to blame when the SEO folks complain about an empty <h1>
.
To avoid said behavior, use Nokogiri::XML
instead of Nokogiri::HTML
when parsing your HTML string. As long as tags are closing in the correct order, you should be fine.
However, when converting back to an HTML string, you may end up with extra linebreaks that were not there before.
>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_s
=> "<h1>\n <p>foo</p>\n <span>bar</span>\n</h1>"
In HTML, line breaks between elements are meaningful (they are whitespace which may appear between inline elements), so you would probably prefer not having them.
Instead of to_s
, use to_xml
and pass some
SaveOptions
Show archive.org snapshot
.
>> Nokogiri::XML.fragment("<h1><p>foo</p><span>bar</span></h1>").to_xml(save_with: Nokogiri::XML::Node::SaveOptions::AS_HTML)
=> "<h1><p>foo</p><span>bar</span></h1>"
If the input HTML contains a line break (somewhere inside the document, not at the beginning or end), the behavior of to_s
magically changes to the above.
>> Nokogiri::XML.fragment("<h1><p>foo</p>\n<span>bar</span></h1>").to_s
=> "<h1><p>foo</p>\n<span>bar</span></h1>"
While using to_s
would be fine in such cases, I suggest you keep using the more explicit to_xml(save_with: ...AS_HTML)
.
Final note: Nokogiri's XML nodes (the Ruby objects) behave a bit differently than HTML nodes. For example, use name
instead of tag
when reading or modifying the document.