Read more

Nokogiri: How to parse large XML files with a SAX parser

Florian Leinsinger
December 02, 2021Software engineer at makandra GmbH

In my case [...] the catalog is an XML that contains all kinds of possible products, categories and vendors and it is updated once a month. When you read this file with the Nokogiri default (DOM) parser, it creates a tree structure with all branches and leaves. It allows you to easily navigate through it via css/xpath selectors.

The only problem is that if you read the whole file into memory, it takes a significant amount of RAM. It is really ineffective to pay for a server if you need this RAM once a month. Since I don't need to navigate through the tree structure, but just replicate all the needed data into the database, the best option is to use SAX parser.

When you read very large XML files Nokogiri may explode with this message when creating the tree structure of your file:

Nokogiri::XML::XPath::SyntaxError: FATAL: Memory allocation failed: growing nodeset hit limit
Illustration book lover

Growing Rails Applications in Practice

Check out our e-book. Learn to structure large Ruby on Rails codebases with the tools you already know and love.

  • Introduce design conventions for controllers and user-facing models
  • Create a system for growth
  • Build applications to last
Read more Show archive.org snapshot

A SAX parser Show archive.org snapshot could be a way to solve this problem.

Posted by Florian Leinsinger to makandra dev (2021-12-02 09:10)