Two clouds and a bicycle

Yesterday, I got to solve a puzzle that I’ve solved more than a few times before. “How do we get a moderate amount of data from here to there in reasonably short order?”

The specifics in this sort of puzzle change all the time. What constitutes a moderate amount of data? What sort of distance matters between “here,” and “there?” Is it miles? vendors? representation?

Other factors, like the timeline, remain constant. “Reasonably short order,” usually means “by tomorrow morning,” give or take. It’s a short, but achievable timeline where smart decisions to reduce latencies in the real world (start immediately!) matter as much or more than decisions about transfer protocols or software tools that might give a more perfect use of resources.

It’s a case where the perfect is the enemy of the good, and where having seen puzzles like this before confers an advantage. As is true throughout life, having friends helps too.

Here’s the story:

500GB on a jump drive

Diamond Age Data Science is a consulting company in the business of computational biology and bioinformatics. The founder texted me in the early afternoon: They had agreed to do a a particular analysis, “before the end of the year.” They planned to use Amazon’s cloud, and the source data was on a 1TB USB disk at LabCentral in Cambridge, MA. The customer had tried a sensible thing – plug the disk into a laptop and use the GUI to drag and drop the files.

The time estimate was FOUR DAYS from now.

In real terms, this would have consumed one of the two weeks remaining before the end of year holidays. This would put the deadline at risk, and had the potential to interfere with a laptop-free vacation.

In addition, the customer wanted to take their laptop home in the evenings even during the data transfer. That meant stopping and restarting, or perhaps lashing together an impromptu workstation.

It was only about 500GB of data. 500 gigabytes is 4,000 gigabits. At one gigabit per second (a relatively standard wired connection in a modern office), this was not less than 66 minutes worth of transfer. Even accounting for inefficiencies on the order of a factor of two or three – this shouldn’t have been an overnight problem – much less a week of delay.

I happened to be about a 10 minute bicycle ride from the lab in question, and this is a game I like to play.

To make it more fun, I decided to up the ante from “tomorrow morning” to “before dinner.”

Could I move the data and still get home in time for dinner with a friend from out of town? I figured that it was reasonably possible, but I couldn’t afford to go in the wrong direction for even an hour.

As an aside, overhead and protocol challenges are a really big deal in large scale data transfers in clinical medicine, finance, satellite, media/entertainment, and so on. For a graduate education on this topic, read up on Michelle Munson’s Aspera. Start with the FASP protocol.

Which way to the cloud

There are a lot of places in Boston and Cambridge with good connectivity. I figured that I could call in a favor with friends at Harvard’s research computing, at the Broad Institute, or even at one of the local pharmaceuticals. Any of these could have worked out fine.

However, all of those groups use fiber optic connections that terminate at the same building: The Markley Data Center at Downtown Crossing. It’s perhaps the single best connected building in New England. Also, the people who run the place are smart and like to play this sort of game.

So I picked up the disk and biked across the river. It took about 20 minutes.

My bicycle was about a 3Gb/sec transfer device – with latencies on the order of 1,200,000 milliseconds.

Linux Servers are Free

There’s an old saying – you never eliminate a bottleneck, you just move it around. Since I wanted to be home in time for dinner, I wanted to think through, rather than discover those bottlenecks. One of the big potential bottlenecks was the wireless card on my laptop. Most wireless networks are not full 1Gb/sec. My Apple laptop, for all that I love it, does not come with a traditional RJ45 network plug. I was going to need a wired connection.

So I texted ahead and asked the folks at Markley if I could borrow a laptop or a workstation to plug the disk into. This was also a hedge – assuming that we could get the data transfer rolling, I wouldn’t be in the same situation as the folks across the river – waiting on a data transfer before unplugging my laptop.

In the same conversation (stopped by the curb on the Longfellow bridge), we decided to create a virtual machine on Markley’s private cloud for use as a staging machine. I’ve been trapped in enough data centers over the years, babysitting some dumb process while my friends ate dinner without me. So I requested a bit of infrastructure. About 90 seconds later, I had an email in my INBOX including some (but not all) of the credentials needed to log into a dedicated internet connected server with a terabyte of attached disk.

The big advantage of the VM was that it would be trivial for me to reach it from home. Also, I could install whatever software I needed on it without inconveniencing the Markley staff or forcing them to re-image the loaner laptop. The in-building network is extremely “clean” with almost zero packet loss and sub-millisecond latencies (better than a bicycle by about 7 orders of magnitude). I figured that it was very likely that we could get the data off of the disk to something in-building in a couple of hours without any drama. The wider internet is not necessarily so clean and consistent.

I was, at this point, assuming that something would go wrong, and that I would be logging in after my guests went home to bump the process along.

Besides, it’s a 90 second task to configure a virtual machine on a private cloud.

So by the time I got to Downtown Crossing, there was a workstation waiting for me, as well as a virtual machine that I could log into. The nice folks there had also found a Thunderbolt to RJ45 adapter, in case I didn’t want to bother with the loaner laptop. On a whim, mostly curious how fast we could make this happen, I plugged the disk into my own laptop and started a simple “scp” of a test file.

Too slow!

I was getting 3 megabytes per second.

3 megabytes per second is 24 megabits per second. About 2.3% of the number that I had been using for my estimates. That put me right back into the “several days” estimate from before. My friend, sitting next to me, tried the same simple copy from the internal disk on his laptop to the server (connected via an identical connection) and got consistent performance of 80 megabytes per second.

In the spirit of not overthinking it (and getting home in time for dinner!) we tested two things at once by swapping the disk over to his machine and starting the in-house copy to the VM. Why mess with it? It was moving into mid-afternoon now, and we wanted this first phase done before we were anywhere close to stressed.

I should note that, at that point, it could -totally- have been the disk. In that case, we would have been stuck. Fortunately, the connection screamed to life at 80MB/sec. He set to work starting a few batches of copies to run in parallel, which squeezed another 10MB/sec or so out of the connection. 90 megabytes per second is 720 megabits per second, or 72% of the theoretical maximum of the connection. Our theoretical 66 minute transfer was going to take 90 minutes to get to the VM.

Duh

It wasn’t until half an hour later that I realized the problem with my laptop: I still had my VPN configured. After all that bicycling and thinking I was so clever – my laptop was doing exactly what I had set it up to do, and spraying packets all over the country, from Seattle to Georgia, so that whatever coffee shop, airline, or hotel I happened to be connecting from couldn’t snoop on me.

After a good laugh, we agreed that I have a really good VPN, all told. Also, we let the copy run, it was already about 1/3 of the way done.

The Last Bit

The final step was almost laughably easy. I wanted the data to wind up on Amazon’s S3. My friend at Diamond Age had configured a user account for me, and had picked a name for the bucket.

It turns out that enough people have solved this problem before that it was mostly copy-pasting examples from the Amazon Command Line FAQ. There was the tiniest bit of Yak Shaving to get Python, PIP, and the Amazon CLI configured on the VM – but after that I ran a command very much like:

aws sync /data s3://my-awesome-bucket

And was rewarded with transfer rates on the order of 66MB/sec.

I verified that I could get ~60 megabytes per second consistently from the private cloud to the public cloud, and also that we could do both transfers concurrently. The link from the laptop to the VM was still contentedly chugging along at 90MB/sec.

I also verified that I could log into the VM from outside of the data center. I would have felt pretty silly to have to come back the next day to do any necessary cleanup.

My friend and I caught up a bit about our plans for the holidays. Then I biked home as the light faded.

By the time I got home, the initial sync command was done, and all the data from the disk was present on the VM. I ran the sync again, with the options to do a detailed check for any differences that might have crept in by copying files in these two stages.

It finished up while I was writing this post, after dinner.

DeepVariant

Earlier this week, Google published DeepVariant, a machine learning (ML) based tool for genomics. The software is now available on the DNANexus platform.

This is kind of a big deal, and also kind of not a big deal.

Does it matter?

It’s a big deal in the same way that ML systems exceeding the performance of radiologists on diagnostic classification of images is a big deal. Sure, it’s a little creepy and intimidating when a computer program exceeds a respected and trained professional at one of their tasks. On the other hand, it would take a spectacularly naive and arrogant person to claim that a radiologist’s only job is image classification.

It’s not a big deal because there is still so much domain expertise required to derive scientifically meaningful results from genomic data, and because these methods are still changing all the time.

The DeepVariant team took one of the points in the genomic analysis workflow where scientists have historically used eyeballs and intuition to identify subtle patterns in the data. Prior variant callers were built atop that intuition, coding it into complex algorithms. That’s why there was a well characterized image format (Pileup) already available as a starting point for the project – scientists still want to look at the results of their callers to see if the results align with intuition.

That’s why there was a contest for the team to win. Because we’re still figuring this stuff out.

It was a good place to start, and the system performed much as we might expect.

Much to Learn

I saw a preview of this technology at the Broad Institute, sometime in mid to late 2016. We were all really impressed. I remember that someone asked exactly the right question: “Can it -discover- a new sort of biological artifact or feature? One that we haven’t seen before?”

The team was unambiguous: Of course it can’t. Until the patterns are present in the training data, there’s nothing there to learn. Further, this particular approach will never suggest that, maybe, we’re looking at the problem sideways.

Put another way: There is a lot of genomic biology still to be learned.

Every year that I’ve been in and around this field, there has been at least one discovery that has up-ended major pieces of institutional knowledge and dogma. Formerly safe assumptions get melted down, combined with new insights, and formed into superior alloys all the time.

The more subtle challenge

There is a more subtle challenge in this particular case: We’re dealing with measurements rather than facts here. The process of DNA sequencing is complex and subtle, with biases and omissions and room for improvement throughout. The way that this particular test was framed up assumes that there is one unambiguous correct answer to the question of variation, and that we already know that answer.

A genomic biologist – or scientist of any stripe – has to hold two truths in their head at the same time: They must gather data to answer questions, and they must also accept that the data may suggest refinements to the question itself. Those refinements to the question, the ones that call existing knowledge in question – that’s where the real innovation happens.

Given enough data, machine learning now excels at answering well formed questions. The task of questioning our assumptions and changing the question itself remains much more subtle.

The take home

The short version is that computers are here, right now, to take away any part of any job that involves memorizing a large corpus of data and then identifying new examples of old categories based on patterns in that data. This is just as true for eyeballing pileup images as it is for reading ZIP codes or license plates.

Machine learning is also here for any part of your job in which you merely turn the crank on a bunch of rules and formulas. This has already impacted a bunch of different jobs: Tax preparation, law, real estate, and travel planning have all undergone radical changes in the last decade.

One final thought: This is also a big deal because while it takes massive computation to create a recognizer like DeepVariant, it is trivial to use that recognizer on any particular input. Variant calling in the old model takes up a lot of CPU power – which can now be turned (hopefully) to more subtle questions.