How to load only a subset of a massive MySQL dump

Posted . Visible to the public.

I had a huge MySQL dump that took forever (as in: days) to import, while I actually just wanted to have the full database structure with some data to use on my development machine.

After trying several suggestions on how to speed up slow MySQL dump imports (which did not result in any significant improvement), I chose to import just some rows per table to suffice my needs. Since editing the file was not an option, I used a short Ruby script to manage that.

Here is how:

pv huge.dump | ruby -e 'ARGF.each_line { |l| m = l.match(/^INSERT INTO \`.+\` .+ VALUES \((\d+),/); puts l if !m || m[1].to_i < 200_000 || l =~ /schema_migrations/ }' | mysql -uUSER -pSECRET DATABASE_NAME

The command above does the following:

  1. It sends huge.dump to stdout. You could do that with cat, but I chose to use pv for a nice progress bar.
  2. The output is piped into a Ruby script which:
    1. Checks, line by line, if the current line is an INSERT statement.
    2. If it is not, it's printed (to stdout)
    3. If it is, we only print it when it inserts an ID smaller than 200'000 (IDs are always the first column, so we can check against "VALUES (\d+,").
    4. If it includes any mention of schema_migrations we also print it (because we want them all).
  3. The result of the Ruby script is piped into the mysql client. Replace USER and SECRET with your database credentials, and DATABASE_NAME with the database you are going to import into.

Note the following:

  • This is far from perfect.
  • I assume a significant amount of time is spent in Ruby. This could probably be improved by using tools such as sed and awk, but I did not want to go down that road.
  • Because of encoding issues this can break the imported data in many ways.
  • It also would not work as expected for tables who are missing an id column (which you shouldn't do), or where that column is not the first.
  • Dumps usually insert records in batches, so we check only for the first ID that is inserted per batch.
  • Records in the database may very well not match up and be considered invalid, just because associated records (with higher IDs) might be missing, or similar.
Arne Hartherz
Last edit
License
Source code in this card is licensed under the MIT License.
Posted by Arne Hartherz to makandra dev (2014-01-15 10:44)