Read more
How to load only a subset of a massive MySQL dump
I had a huge MySQL dump that took forever (as in: days) to import, while I actually just wanted to have the full database structure with some data to use on my development machine.
UI/UX Design by
We make sure that your target audience has the best possible experience with your digital product. You get:
-
Design tailored to your audience
-
Proven processes customized to your needs
-
An expert team of experienced designers
Read more
Show archive.org snapshot
After trying several suggestions on how to speed up slow MySQL dump imports (which did not result in any significant improvement), I chose to import just some rows per table to suffice my needs. Since editing the file was not an option, I used a short Ruby script to manage that.
Here is how:
pv huge.dump | ruby -e 'ARGF.each_line { |l| m = l.match(/^INSERT INTO \`.+\` .+ VALUES \((\d+),/); puts l if !m || m[1].to_i < 200_000 || l =~ /schema_migrations/ }' | mysql -uUSER -pSECRET DATABASE_NAME
The command above does the following:
- It sends
huge.dump
to stdout. You could do that with cat
, but I chose to use pv
for a nice progress bar.
- The output is piped into a Ruby script which:
- Checks, line by line, if the current line is an
INSERT
statement.
- If it is not, it's printed (to stdout)
- If it is, we only print it when it inserts an ID smaller than 200'000 (IDs are always the first column, so we can check against "
VALUES (\d+,
").
- If it includes any mention of
schema_migrations
we also print it (because we want them all).
- The result of the Ruby script is piped into the mysql client. Replace
USER
and SECRET
with your database credentials, and DATABASE_NAME
with the database you are going to import into.
Note the following:
- This is far from perfect.
- I assume a significant amount of time is spent in Ruby. This could probably be improved by using tools such as
sed
and awk
, but I did not want to go down that road.
- Because of encoding issues this can break the imported data in many ways.
- It also would not work as expected for tables who are missing an
id
column (which you shouldn't do), or where that column is not the first.
- Dumps usually insert records in batches, so we check only for the first ID that is inserted per batch.
- Records in the database may very well not match up and be considered invalid, just because associated records (with higher IDs) might be missing, or similar.
Posted by Arne Hartherz to makandra dev (2014-01-15 11:44)