How to Build a Data Transfer Tool that Maintains Data Consistency

Sean Dean

For whatever reason one of the things I’ve ended up doing a lot during my time at SiteCrafting has been building import/export tools. Off the top of my head I can count at least five major projects and tons of minor projects that have required building a tool that takes data from one source, processes it and inserts it in a different system. 

Initially building out these tools was daunting. My first project involved moving approximately 200,000 rows of data between two incompatible systems. But over time, I’ve created a process for building tools to move data around. This is a rough sketch of what that looks like.

Assess the Data you Have

The success or failure of an import/export tool is largely dependent on an honest assessment of the data that is being moved, where it’s coming from and where it’s going. Considerations in this stage include how much data needs to be moved and what (if any) return on investment there is in building out a tool. 

  • First, take a look at how much data there is to move. If it’s less than 20 or 30 rows of data, it could easily be faster to manually move the data than build out a tool to do it for you. But if you’re looking at hundreds or thousands of rows of data, building out a tool to move it will likely be the best route forward for you. Also, consider whether your data is “clean” or not.Recently, I had to build a tool for a client that had a lot of out-of-date, useless data as well as a good deal of data that was incomplete or wrong. Ideally, all the data would have been cleaned up before we ran the tool to move the data, but the client was more interested in moving data and planned on cleaning it up on the receiving end of the process.If you have the ability to clean up your data prior to moving it, that will make the process smoother and more efficient.
  • Secondly, consider the return on investment (ROI) of the tool. As mentioned previously there will be a good ROI if you can build out a tool in less time than it would take to manually move the data. An additional consideration for ROI is whether this tool will ever be used again. If this is a process that you will repeat over and over, then having a tool that can do in seconds what would take hours makes sense. On the other hand, if there is already a tool that exists that will do what you need, the cost of building out another tool may be too high.

What Does Success Look Like?

Once you have an honest assessment of your data you need to decide what success looks like for this tool. Yes, ideally everything will come over with zero loss of data, but if you’re working with unclean data there’s a good chance that compromises will need to be made. Ask yourself what is an acceptable loss of data and what data is absolutely necessary for the new iteration of your application to function properly. 

Of course you’ll try to get as much data over cleanly as possible, but knowing what success looks like will prevent a lot of heartache when things don’t go as well as you hope. Bugs are not unheard of and as human developers we will occasionally miss out on an error. Just the other day I noticed that I somehow mapped a column called parent_id to another called type_id. Why did this happen and how did I miss it before we pushed it to production? Really who knows, but it didn’t break much and was easily fixed. I was able to repair the issues and move on because this column is not on the list of critical data to be transferred. I fixed the bug and cleaned up the data it broke and moved on. 

If I hadn’t had a clear understanding of what success looks like this could have thrown me for a loop.

Prepare for Data Transfer

If you haven’t already done so, you should clean up as much data as you can. Be sure to remove any old, outdated or unneeded data. There is no use bringing over data that is not needed. Also, be sure to clean up the data so that things are formatted correctly. For instance, on another tool I built, one of the actions was to move hours of operation for individual locations. The new app needed to have exactly seven sets of hours of operation, but in the old app there could be between one and seven sets. We had to manually go in and make sure each location had the right number of sets so that the new app wouldn’t crash when editing a location. 

The final data schema (structure) of the new app needs to be set up before you can map and transfer data. Making sure this schema is built out both in a test environment and in production is important. Testing is super important. If you don’t have a place to do testing, you are destined to fail. Be sure that your test environment is identical to the production environment or again, you are destined to fail. 

Build a Datamap

Your datamap is really the key to the entire process of moving data from one place to another. Recently I’ve taken to building my datamap as a JSON file since most major programming languages are able to read JSON out of the box. But YAML or XML could work as well if you’re comfortable coding in those languages. You can choose whatever format works best for you, but I generally prefer to build out large arrays or objects that are composed of key-value pairs that indicate the column the data is coming from and the column the data is going to be inserted into.

The first step is to map out all the tables you’re pulling data out of and make them your top-level objects. Then within each table-object you provide the columns that you’re exporting with the value being the column they’re going into. Since it’s likely that the schema of the new app is somewhat different from what you’re pulling out of there will be some data that needs to be processed and reformatted to appropriately insert. For these columns I make the value a callback function that is able to process the data and return the clean value. These values are objects that have the name of the callback function and the parameters necessary to process the data. 

Something to keep in mind is that if you are importing data into an already populated database you’ll need to develop a system for mapping old IDs to new IDs since you probably won’t be able to maintain ID consistency between applications.

Code, Test, Repeat

Most of my importers now consist of three parts: a script that calls a controller, the controller that loops through the datamap and has the individual callback functions needed to process data manipulations and, of course, the datamap. A choice you’ll need to make is how the script will be run. Will it be run via the command line, a graphical interface, or via an automated process like Cron? From this point, you can make a decision on how to build out your processing system. 

After the build is done, test and verify the data. Update and modify your code as necessary and test again. Continue iterating until you are secure in the outcome. If you’re moving a large quantity of data it can be difficult to verify all the data, but be certain to do spot checks on all the tables you’re inserting data into. 

Running the Tool

Pick a time to run your importer when traffic to your application is on the low side, preferably before you launch. Provided you’ve done all your preparation and testing running your importer should be stress-free and successful. 

After running imports on millions of rows of data I can tell you that proper preparation makes for a successful implementation. And there are few things in this line of work that feel as good as running a successful data transfer process. Yes, that sounds suuuuper nerdy, but it’s true.

Best of luck with your imports and exports.