Import Excel files without running into memory limitations

There are several gems that make it easy to read and process xlsx files. Parsing the entire file at once however is error-prone since each cell is transformed to a ruby object - sometimes including thousands of formatted but empty cells.

As of today, I found two promising alternatives that provide a stream-based access to spradsheet rows:

  • Roo Show archive.org snapshot supports multiple spreadsheet types like ODS or CSV and has a quite large contributor base
  • Creek Show archive.org snapshot uses a similar approach and worked fine for me as well. It does not support XLS or CSV files but is the clear choice if you have to process metadata like hidden rows. All numbers will be parsed as floats though.

In my case, I had to switch from SimpleXlsxReader Show archive.org snapshot to Creek since a badly formatted 12MB Sheet broke the 500MB Staging Memory limit when reading it as a whole.

The implementation follows the Creek's documentation Show archive.org snapshot :

document = Creek::Book.new('/path/to/file.xlsx')
document.sheets.each do |sheet|
  sheet.rows.each do |row|
    cells = row.values
    # Process this row
  end
end

You should note two things:

  • The row hash maps cell names to their values
  • sheet.rows is an Enumerator and yields each row subsequently
Michael Leimstädtner Over 6 years ago