Read more

Import Excel files without running into memory limitations

Michael Leimstädtner
August 30, 2017Software engineer at makandra GmbH

There are several gems that make it easy to read and process xlsx files. Parsing the entire file at once however is error-prone since each cell is transformed to a ruby object - sometimes including thousands of formatted but empty cells.

Illustration UI/UX Design

UI/UX Design by makandra brand

We make sure that your target audience has the best possible experience with your digital product. You get:

  • Design tailored to your audience
  • Proven processes customized to your needs
  • An expert team of experienced designers
Read more Show archive.org snapshot

As of today, I found two promising alternatives that provide a stream-based access to spradsheet rows:

  • Roo Show archive.org snapshot supports multiple spreadsheet types like ODS or CSV and has a quite large contributor base
  • Creek Show archive.org snapshot uses a similar approach and worked fine for me as well. It does not support XLS or CSV files but is the clear choice if you have to process metadata like hidden rows. All numbers will be parsed as floats though.

In my case, I had to switch from SimpleXlsxReader Show archive.org snapshot to Creek since a badly formatted 12MB Sheet broke the 500MB Staging Memory limit when reading it as a whole.

The implementation follows the Creek's documentation Show archive.org snapshot :

document = Creek::Book.new('/path/to/file.xlsx')
document.sheets.each do |sheet|
  sheet.rows.each do |row|
    cells = row.values
    # Process this row
  end
end

You should note two things:

  • The row hash maps cell names to their values
  • sheet.rows is an Enumerator and yields each row subsequently
Posted by Michael Leimstädtner to makandra dev (2017-08-30 13:44)