Posted almost 6 years ago. Visible to the public.

How to load only a subset of a massive MySQL dump

I had a huge MySQL dump that took forever (as in: days) to import, while I actually just wanted to have the full database structure with some data to use on my development machine.

After trying several suggestions on how to speed up slow MySQL dump imports (which did not result in any significant improvement), I chose to import just some rows per table to suffice my needs. Since editing the file was not an option, I used a short Ruby script to manage that.

Here is how:

Copy
pv huge.dump | ruby -e 'ARGF.each_line { |l| m = l.match(/^INSERT INTO \`.+\` .+ VALUES \((\d+),/); puts l if !m || m[1].to_i < 200_000 || l =~ /schema_migrations/ }' | mysql -uUSER -pSECRET DATABASE_NAME

The command above does the following:

  1. It sends huge.dump to stdout. You could do that with cat, but I chose to use pv for a nice progress bar.
  2. The output is piped into a Ruby script which:
    1. Checks, line by line, if the current line is an INSERT statement.
    2. If it is not, it's printed (to stdout)
    3. If it is, we only print it when it inserts an ID smaller than 200'000 (IDs are always the first column, so we can check against "VALUES (\d+,").
    4. If it includes any mention of schema_migrations we also print it (because we want them all).
  3. The result of the Ruby script is piped into the mysql client. Replace USER and SECRET with your database credentials, and DATABASE_NAME with the database you are going to import into.

Note the following:

  • This is far from perfect.
  • I assume a significant amount of time is spent in Ruby. This could probably be improved by using tools such as sed and awk, but I did not want to go down that road.
  • Because of encoding issues this can break the imported data in many ways.
  • It also would not work as expected for tables who are missing an id column (which you shouldn't do), or where that column is not the first.
  • Dumps usually insert records in batches, so we check only for the first ID that is inserted per batch.
  • Records in the database may very well not match up and be considered invalid, just because associated records (with higher IDs) might be missing, or similar.

Does your version of Ruby on Rails still receive security updates?
Rails LTS provides security patches for old versions of Ruby on Rails (3.2 and 2.3).

Owner of this card:

Avatar
Arne Hartherz
Last edit:
almost 6 years ago
About this deck:
We are makandra and do test-driven, agile Ruby on Rails software development.
License for source code
Posted by Arne Hartherz to makandra dev
This website uses cookies to improve usability and analyze traffic.
Accept or learn more