Best practices: Large data migrations from legacy systems
Migrating data from a legacy into a new system can be a surprisingly large undertaking. We have done this a few times. While there are significant differences from project to project, we do have a list of general suggestions.
Before you start, talk to someone who has done it before, and read the following hints:
Before any technical considerations, you need to understand the old system as best as possible. If feasible, do not only look at its API, or database, or frontend, but let a user of the old system show you its backend and how it works.
Often data will not map to the new system in an obvious manner. Identify the complicated stuff as early as possible, and come up with a strategy.
This obviously differs significantly project to project.
- If the old system has a good API that supplies all necessary information, use that.
- Otherwise, try to get access to the database. If it is structured in a sensible way, you might try to mount the old database using ActiveRecord models.
- If it does not map well to ActiveRecord, you could use Sequel instead of writing plain SQL.
- If all else fails, we have also used screen scrapping in the past.
Do not underestimate the difficulties of actually getting the data to us, and possibly getting an unfamiliar database system running. If we can get access to an existing remote database (and the amount of data is not prohibitively large), consider using that. If we have to transfer large amounts of data, having the customer send us USB sticks is an option.
If the amount of data is large enough that a migration takes significant time, use Sidekiq. If not, it might still be useful for the improved fault tolerance.
In this case, your migration code will usually consist of
- code that just discovers all records to migrate and enqueues all the jobs
- code that migrates a single record
The migration will typically run a few times on staging, and once on production. While it does not have to be blazingly fast, it is not okay for the migration to take several days, if it can be at all avoided.
When the migration is that slow, it makes it very slow to find and debug issues, since the feedback loop gets so long.
Your code needs to check if the incoming data matches expectations. If it cannot handle some input, it needs to crash or at least clearly log the issue.
If we cannot deal with some outliers, it might be okay to import them manually, but we have to know about them.
Use dedicated logs. While running the migration we should
- get some indication of its progress
- see if there are problems or if it works as expected
- have enough information to debug issues afterwards
Add some info about the migration process to the database in the new system. For each migrated record, you at least want to track
- that it was in fact migrated, and not generated by a user
- when it was migrated, and if it was changed by a user in the meantime
- to which record (ID or maybe URL) it maps in the old system
In some situations it might even be worth to save a copy of the actual source data, or at least of those parts that were used to extract attribute values.
Put this data into separate records associated to the migrated records, instead of adding it to the tables of the application's existing records. Don't hesitate to add associations across those records, if they are related in some way.
In development, Rails' sandbox mode might be useful.
The migration code should live in some subdirectory in the project's main repository. Normal coding guidelines apply.
Try to make your migration code not too clever. For example, it is okay if the migration is not fully automatic and takes a few manual (and documented) steps, instead of implementing some complicated Sidekiq batch logic etc.
You need to write tests as you will usually have to refactor your code several times, and need to know that you did not break anything. Prefer "big picture" integration-like tests (using RSpec) over unit tests.
As early as possible, take actual time to compare the results of your migration to the old system. If possible, have a user familiar with the old system (i.e. the customer) do this.
Do not look only at one or two examples, but at a wider range. To accomplish this, you may want to implement a simple function that maps (e.g. the URL of) a random record of the old application to its new counterpart. This prevents a certain bias when reviewing a selection of records.