Yesterday, I got to solve a puzzle that I’ve solved more than a few times before. “How do we get a moderate amount of data from here to there in reasonably short order?”
The specifics in this sort of puzzle change all the time. What constitutes a moderate amount of data? What sort of distance matters between “here,” and “there?” Is it miles? vendors? representation?
Other factors, like the timeline, remain constant. “Reasonably short order,” usually means “by tomorrow morning,” give or take. It’s a short, but achievable timeline where smart decisions to reduce latencies in the real world (start immediately!) matter as much or more than decisions about transfer protocols or software tools that might give a more perfect use of resources.
It’s a case where the perfect is the enemy of the good, and where having seen puzzles like this before confers an advantage. As is true throughout life, having friends helps too.
Here’s the story:
500GB on a jump drive
Diamond Age Data Science is a consulting company in the business of computational biology and bioinformatics. The founder texted me in the early afternoon: They had agreed to do a a particular analysis, “before the end of the year.” They planned to use Amazon’s cloud, and the source data was on a 1TB USB disk at LabCentral in Cambridge, MA. The customer had tried a sensible thing – plug the disk into a laptop and use the GUI to drag and drop the files.
The time estimate was FOUR DAYS from now.
In real terms, this would have consumed one of the two weeks remaining before the end of year holidays. This would put the deadline at risk, and had the potential to interfere with a laptop-free vacation.
In addition, the customer wanted to take their laptop home in the evenings even during the data transfer. That meant stopping and restarting, or perhaps lashing together an impromptu workstation.
It was only about 500GB of data. 500 gigabytes is 4,000 gigabits. At one gigabit per second (a relatively standard wired connection in a modern office), this was not less than 66 minutes worth of transfer. Even accounting for inefficiencies on the order of a factor of two or three – this shouldn’t have been an overnight problem – much less a week of delay.
I happened to be about a 10 minute bicycle ride from the lab in question, and this is a game I like to play.
To make it more fun, I decided to up the ante from “tomorrow morning” to “before dinner.”
Could I move the data and still get home in time for dinner with a friend from out of town? I figured that it was reasonably possible, but I couldn’t afford to go in the wrong direction for even an hour.
As an aside, overhead and protocol challenges are a really big deal in large scale data transfers in clinical medicine, finance, satellite, media/entertainment, and so on. For a graduate education on this topic, read up on Michelle Munson’s Aspera. Start with the FASP protocol.
Which way to the cloud
There are a lot of places in Boston and Cambridge with good connectivity. I figured that I could call in a favor with friends at Harvard’s research computing, at the Broad Institute, or even at one of the local pharmaceuticals. Any of these could have worked out fine.
However, all of those groups use fiber optic connections that terminate at the same building: The Markley Data Center at Downtown Crossing. It’s perhaps the single best connected building in New England. Also, the people who run the place are smart and like to play this sort of game.
So I picked up the disk and biked across the river. It took about 20 minutes.
My bicycle was about a 3Gb/sec transfer device – with latencies on the order of 1,200,000 milliseconds.
Linux Servers are Free
There’s an old saying – you never eliminate a bottleneck, you just move it around. Since I wanted to be home in time for dinner, I wanted to think through, rather than discover those bottlenecks. One of the big potential bottlenecks was the wireless card on my laptop. Most wireless networks are not full 1Gb/sec. My Apple laptop, for all that I love it, does not come with a traditional RJ45 network plug. I was going to need a wired connection.
So I texted ahead and asked the folks at Markley if I could borrow a laptop or a workstation to plug the disk into. This was also a hedge – assuming that we could get the data transfer rolling, I wouldn’t be in the same situation as the folks across the river – waiting on a data transfer before unplugging my laptop.
In the same conversation (stopped by the curb on the Longfellow bridge), we decided to create a virtual machine on Markley’s private cloud for use as a staging machine. I’ve been trapped in enough data centers over the years, babysitting some dumb process while my friends ate dinner without me. So I requested a bit of infrastructure. About 90 seconds later, I had an email in my INBOX including some (but not all) of the credentials needed to log into a dedicated internet connected server with a terabyte of attached disk.
The big advantage of the VM was that it would be trivial for me to reach it from home. Also, I could install whatever software I needed on it without inconveniencing the Markley staff or forcing them to re-image the loaner laptop. The in-building network is extremely “clean” with almost zero packet loss and sub-millisecond latencies (better than a bicycle by about 7 orders of magnitude). I figured that it was very likely that we could get the data off of the disk to something in-building in a couple of hours without any drama. The wider internet is not necessarily so clean and consistent.
I was, at this point, assuming that something would go wrong, and that I would be logging in after my guests went home to bump the process along.
Besides, it’s a 90 second task to configure a virtual machine on a private cloud.
So by the time I got to Downtown Crossing, there was a workstation waiting for me, as well as a virtual machine that I could log into. The nice folks there had also found a Thunderbolt to RJ45 adapter, in case I didn’t want to bother with the loaner laptop. On a whim, mostly curious how fast we could make this happen, I plugged the disk into my own laptop and started a simple “scp” of a test file.
I was getting 3 megabytes per second.
3 megabytes per second is 24 megabits per second. About 2.3% of the number that I had been using for my estimates. That put me right back into the “several days” estimate from before. My friend, sitting next to me, tried the same simple copy from the internal disk on his laptop to the server (connected via an identical connection) and got consistent performance of 80 megabytes per second.
In the spirit of not overthinking it (and getting home in time for dinner!) we tested two things at once by swapping the disk over to his machine and starting the in-house copy to the VM. Why mess with it? It was moving into mid-afternoon now, and we wanted this first phase done before we were anywhere close to stressed.
I should note that, at that point, it could -totally- have been the disk. In that case, we would have been stuck. Fortunately, the connection screamed to life at 80MB/sec. He set to work starting a few batches of copies to run in parallel, which squeezed another 10MB/sec or so out of the connection. 90 megabytes per second is 720 megabits per second, or 72% of the theoretical maximum of the connection. Our theoretical 66 minute transfer was going to take 90 minutes to get to the VM.
It wasn’t until half an hour later that I realized the problem with my laptop: I still had my VPN configured. After all that bicycling and thinking I was so clever – my laptop was doing exactly what I had set it up to do, and spraying packets all over the country, from Seattle to Georgia, so that whatever coffee shop, airline, or hotel I happened to be connecting from couldn’t snoop on me.
After a good laugh, we agreed that I have a really good VPN, all told. Also, we let the copy run, it was already about 1/3 of the way done.
The Last Bit
The final step was almost laughably easy. I wanted the data to wind up on Amazon’s S3. My friend at Diamond Age had configured a user account for me, and had picked a name for the bucket.
It turns out that enough people have solved this problem before that it was mostly copy-pasting examples from the Amazon Command Line FAQ. There was the tiniest bit of Yak Shaving to get Python, PIP, and the Amazon CLI configured on the VM – but after that I ran a command very much like:
aws sync /data s3://my-awesome-bucket
And was rewarded with transfer rates on the order of 66MB/sec.
I verified that I could get ~60 megabytes per second consistently from the private cloud to the public cloud, and also that we could do both transfers concurrently. The link from the laptop to the VM was still contentedly chugging along at 90MB/sec.
I also verified that I could log into the VM from outside of the data center. I would have felt pretty silly to have to come back the next day to do any necessary cleanup.
My friend and I caught up a bit about our plans for the holidays. Then I biked home as the light faded.
By the time I got home, the initial sync command was done, and all the data from the disk was present on the VM. I ran the sync again, with the options to do a detailed check for any differences that might have crept in by copying files in these two stages.
It finished up while I was writing this post, after dinner.