Author: cdwan

Manufacturing improvements apply to HPC

The Strategy Board

My former colleagues at the Broad Institute recently published a marvelous case study. They describe, in a delightfully brisk and jargon-free way, some of the process improvements they used to radically increase the productivity of the genome sequencing pipeline.

This post is about bringing the benefits of their thinking to our high performance computing (HPC) systems.

The fundamental change was to modify the pipeline of work so that instead of each stage “pushing” to the next, stations would “pull” work when they were ready to receive it. This should be familiar to folks who have experience with Kanban. It also overlaps with both Lean and Agile management techniques. My favorite part of the paper is that they applied similar techniques to knowledge work – with similar gains.

The spare text of the manuscript really doesn’t do justice to what we called the “strategy board meeting.” By the time I started attending in 2014 it was a massive thing, with fifty to a hundred people gathering every Wednesday morning. It was standing room only in front of a huge floor-to-ceiling whiteboard covered with brightly colored tape, dry erase writing, and post-it notes. Many of the post-it notes had smaller stickers stuck on them!

Somehow, in an hour or less every week, we would manage to touch on every part of the operation – from blockers in the production pipeline through to experimental R&D.

My favorite part was that it was a living experiment. Some weeks we would arrive to find that the leadership team had completely re-jiggered some part of the board – or the entire thing. They would explain what they were trying to do and how they hoped we would use it, and then we would all give it a try together.

I really can’t explain better than the paper itself. It’s 100% worth the read.

The computational analysis pipeline

When I started attending those strategy board meetings in 2014, I was responsible for research computing. This included, among other things, the HPC systems that we used to turn the raw “reads” of DNA sequence into finished data products. This was prior to Broad’s shift to Google’s Cloud Platform, so all of this happened on a large but finite number of computers at a data center in downtown Boston.

At that time, “pull” had not really made its way into the computational side of the house. Once the sequencers finished writing their output files to disk, a series of automatic processes would submit jobs onto the compute cluster. It was a classic “push,” with the potential for a nearly infinite queue of Work In Progress. Classical thinking is that healthy queue is a good thing in HPC. It gives the scheduler lots of jobs to choose from, which means that you can keep utilization high.

Unfortunately, it can backfire.

One of the little approximations that we make with HPC schedulers is to give extra priority to jobs that have been waiting a long time to run. On this system, we gave one point of priority (a totally arbitrary number) for every hour that a job had been waiting. On lightly loaded systems, this smooths out small weirdnesses and prevents jobs from “starving.”

In this case, it blew up pretty badly.

At the time, there were three major steps in the genome analysis pipeline: Base calling, alignment, and variant calling.

In the summer of 2015, we accumulated enough jobs in the middle stage of the pipeline (alignment) that some jobs were waiting a really long time to run. This meant that they amassed massive amounts of extra priority in the scheduler. This extra priority was enough to put them in front of all of the jobs from the final stage of the pipeline.

We had enough work in the middle of the pipeline, that the final stage ran occasionally, if at all.

Unfortunately, it didn’t all tip over and catch fire at once. The pipeline was in a condition from which it was not going to recover without significant intervention, but it would still emit a sample from time to time.

As the paper describes, we were able to expedite particular critical samples – but that only made things worse. Not only did it increase the wait for the long-suffering jobs in the middle of the pipeline, but it also distracted the team with urgent but ultimately disruptive and non-strategic work.

Transparency

One critical realization was that in order for things to work, the HPC team needed to understand the genomic production pipeline. From a system administration perspective, we had high utilization on the system, jobs were finishing, and so on. It was all too easy to push off complaints about slow turnaround time on samples as just more unreasonable demands from an insatiable community of power-users.

Once we all got in front of the same board and saw ourselves as part of the same large production pipeline, things started to click.

A bitter pill

Once we knew what was going on, it was clear that we had to drain that backlog before things were going to get better. It was a hard decision because it meant that we had to make an explicit choice to deliberately slow input from the sequencers. We also had to choose to rate-limit output from variant calling.

Once we adopted a practice of titrating work into the system only at sustainable levels, we were able to begin to call our shots. We measured performance, made predictions, hit those predictions, fixed problems that had been previously invisible, and added compute and storage resources as appropriate. It took months to finish digging out of that backlog, and I think that we all learned a lot along the way.

All of this also gave real energy to Broad’s push to use Google’s cloud for compute and data storage. That has been transformational on a number of fronts, since it turns a hardware constraint into a money constraint. Once we were cloud-based we could choose to buy our way out of a backlog, which is vastly more palatable than telling world-leading scientists to wait months for their data.

Seriously, if your organization is made out of human beings, read their paper. It’s worth your time, even if you’re in HPC.

Surfing the hype curve

I’ve spent most of my career on the uncomfortable edge of technology. This meant that I was often the one who got to deal with gear that was being pushed into production just a little bit too early, just a little bit too fast, and just a little bit too aggressively for everything to go smoothly.

This has left me more than a little bit jaded on marketing hype.

Not too long ago I posted a snarky rejoinder on a LinkedIn thread. I said that I had a startup using something called “on-chain AI,” and that we were going to “disrupt the nutraceutical industry.”

I got direct messages from serious sounding people asking if there was still time to get in early on funding me.

Not long after that, a local tech group put out a call for lightning talk abstracts. I went out on a limb and submitted this:

Quantum AI Machine Learning Blockchains on the IoT Cloud Data Ocean: Turning Hype Into Reality


It's easy to get distracted and confused by the hype that surrounds new computing and data storage technologies. This talk will offer working definitions and brutally practical assessments of the maturity of all of the buzzwords in the title.

Somewhat to my horror, they accepted it.

Here are the slides. I would love to hear your thoughts.

Bio-IT World

We’re back around to one of my favorite events of the Boston biotech year, The Bio-IT World Expo.

This conference has been host to a bunch of critical conversations for me. My favorite example happened in 2004. That was the year that the founders of BioTeam and I stepped away from the sessions and the exhibit floor, sat on benches in the stairwells of the Heinz convention center, and worked out the details of how I would become their very first employee.

I don’t think that any of us could have predicted that, 14 years later, we would be hosting a two hour panel in the main auditorium to close out and wrap up the conference. Last year’s version was tremendous fun. I’m super excited to get to moderate it again.

We’ve made a few adjustments this year to make the session even more interactive and fast moving. At the same time, we’re keeping the throwable microphone, dynamic and emerging topic list, and the online question submission / topic tracking system.

The panelists brainstormed up an incredible list of topics:

  • Team culture, keeping it healthy
  • Engineering for nation-state scale projects
  • Identity management in federated environments
  • The changing face of the end-user scientific computing environment, specifically notebook style analysis environments
  • Rapid prototyping of data-driven products
  • Diversity – how, specifically do we intend to empower and amplify emerging voices in our community?
  • What does it take to “validate” a process that was born on a research HPC environment?
  • The maturation of cloud-native, serverless architectures and its uncomfortable collision with current information security and governance processes
  • Data lakes, warehouses, marts, ecosystems, commons, biospheres, bases, gravity, movement, and so on and on
  • Notes from the field as machine learning and AI settle in as mature and productive tools in the kit
  • Emerging technologies like blockchain and how to separate the hype from the reality
  • … and many more …

If you have questions, topics, opinions, or suggestions – please write me or any of the panelists a note.

I’m looking forward to seeing many of you there.

GP-Write

Over the past year, I’ve had the privilege to serve as the chair of a working group (computing) for the GP-Write project. I’m spending today at GP-Write’s annual meeting in Boston.

GP-Write is a highly international, rapidly evolving collaboration with a goal of rapidly advancing the technology and ethical / legal framework necessary for forward engineering of complex genomes. I’m particularly proud of the fact that the very first working group to present is, “Ethical, Legal, Social Implications.” It’s nice, for once, to see the question of what we should do discussed prior to all the excitement about what we can do.

My (brief) slides are below.

Data driven health decisions

I just had a personal experience with how timely, personal measurements can drive better health and lifestyle decisions.

Unfortunately, it wasn’t related to any of the times that I’ve been genotyped, nor was it in the context of care by any physician. In fact, I had to cross state lines in order to get it done.

More on that later.

The punch line, for the curious, is that I have elevated levels of mercury in my system, and I should probably eat a bit lower on the food chain when I order sushi.

Genomics fanboy

I’ve been a genomics fanboy for years. I enrolled in 23 and me when it first came out. I did the exome add-on that they offered briefly in 2012. I signed up with the Personal Genome Project around that time, and one of the exomes you can download from their site is mine. I drove an hour and spat in a tube for the Coriell Personalized Medicine Collaborative.

Coriell has been the most satisfying for me personally, since they occasionally email a PDF of a manuscript that is based on analysis using data derived from my (and many other’s) saliva. For me, at least, getting to read science papers and think, “I helped!” is much more motivating than cryptocurrency based micropayments.

While it’s all been fun and interesting, I haven’t learned very much that was terribly actionable. Without putting too fine a point on it, I have basically re-verified that I don’t have any of the major genomic disorders that would have already shown up by middle age. My standard line describing what I learned is I’m most likely male, almost certainly of northern European descent, likely brown hair, likely brown eyes, etc.

A question of focus

One way this shows up for me is that I don’t really know where to focus my health and lifestyle efforts. Sight unseen, one might tell a person like me that I should work out a little more, mix it up with cardeo and weight bearing exercise, eat a mostly vegetarian diet, don’t smoke, drink in moderation if at all, maintain a regular sleep schedule, use sunblock, floss, don’t sit too long at work, meditate, never read re-tweets, practice test driven development, etc, etc, etc.

None of it appeals and more or less because I know that all this advice is generic. I.e: It doesn’t really apply to me. I’m pretty healthy, so who cares, right?

On the opposite side, I’ve written before about my frustrations in convincing my physicians to screen me for colorectal cancer. I have a family history on both sides, genetic markers, and a medical history that all point in the same direction: Elevated risk. The current state of clinical practice is that men don’t need screening before age 50. I’ve been getting screened since my late 20’s, and I persist in thinking that it’s a really good idea. This is one of those cancers that is easily treatable with early detection and lethal without it.

So there we have it: Advice is either so generic that I ignore it, or else when I do have actionable information it’s a challenge to convince my physician to act on it.

Personalized bloodwork

Enter Arivale. They are a relatively recent addition in the direct to consumer health and lifestyle offerings that are cropping up this year. I heard about them through professional connections (thanks Dave!), and I’ve been excitedly waiting for them to offer services in Massachusetts.

The Arivale process involves a battery of bloodwork, genetic testing, and a gut microbiome (which is a novel experience if you haven’t provided a laboratory with a stool sample before). They combine this with coaching from people trained in nutrition, genetic counseling, and behavioral modification.

Because of the niceties of paying for lab work, I had to leave Massachusetts in order to reach a lab who could actually accept my money to draw the blood. Bright and early on a morning in the middle of February, I made a pre-breakfast, pre-coffee commute into New Hampshire to be stuck with needles.

Let me pause and say that again: Our health care system is so screwed up that I had to cross state lines to get bog-standard bloodwork done, entirely because I was paying out of pocket for it.

I also filled out a battery of health and family history questionnaires, as well as some about personality and lifestyle.

Show me the data

A couple of weeks later, I got an email and logged into a slick web dashboard. I went ahead and did the integration with my Fitbit account. I disabled GPS location sharing but enabled the rest. Let’s hear it for granular access control. Because Fitbit connects to my wireless scale, my Arivale coach was suddenly able to access five years of weight data on top of the four years of info on my pulse, sleep, and walking habits that my Fitbit devices have accumulated.

Let me pause and say that again: I logged into a slick web dashboard and integrated years worth of data about myself in the context of a new battery of lab tests. At no point did I have to write down my previous physician’s FAX number on a piece of paper.

It felt normal and ordinary, because I’m used to these integrations everywhere except health care. I do this sort of thing with my bank, my utilities, my news feed, and all sorts of other places.

That is a different rant, but come on!

Ahem.

Retrograde

Anyway, I logged in and saw (among other things), this:

It honestly gave me pause. I’m pretty robustly healthy. I don’t expect to see any of my biological metrics “in the red,” but there it was.

So I did a quick Google search, top hit, I feel lucky:

A bit of refinement:

Which led me to look at my last few Grubhub orders.

Yeah, every time I order, I bolt on that mackerel. That’s for me. That’s my treat. It’s worth noting that February 15 was the night before I made that hungry, grouchy drive. I know that mercury accumulates in tissue and lingers there over time, your milage may vary, but it’s a pretty clear signal in my book.

And it showed up in my lab work.

Fallout

So there you have it. All of a sudden, I’ve picked something actionable to do for my health – out of the incredible variety of good advice at my fingertips. Because, well:

Converged IT and the Cloud

I promised that I would post a summary from our closing panel at the Converged IT and the Cloud thread at the Molecular Medicine Tri-Conference.

Unfortunately, I was having so much fun in the session itself that I didn’t take any notes at all. Please forgive errors and omissions. My slides are here, but they’re the very least part of the conversation.

I opened up the session with the question that my friend Carolyn posted in the comments of the last post: “What are the biggest barriers to immunotherapy becoming translational (FDA, funding limits, enrollees in clinical trials)? How can patients best support future immunotherapy developments?”.

It sobered the audience considerably, especially when I pointed out that her interest is as a current patient of the system that we all acknowledge has tons of room for improvement.

My point in starting with that question was to move the conversation up a level from IT people talking about IT stuff – and to provide both motivation and urgency. It is very unlikely that a session on “converged IT and the cloud,” would be able to answer Carolyn’s question. That said, we would be remiss to sit around talking about network speeds and feeds, regulatory frameworks, costs per gigabyte, and other technical details without ever engaging with the high level “why” that drives our industry.

Each of the four panelists prepared a brief summary on a specific topic:

Jonathan Sheffi (@sheffi) is the Product Manager for Genomics and Life Sciences within Google Cloud. He spoke about the convergence that he sees in data structures and standards as customers bring different data types like health information, outcomes data, and so on to the “same” cloud. This was pretty exciting to me – since it is the infrastructure groundwork that will support some of the things we’ve been saying about collaboration and integration in the cloud.

Aaron Gardner is with Bioteam, and shared an absolutely whirlwind review of machine learning and AI for our field. The coolest part, to me, was the idea of AI/ML as a de-noising tool. The hope is that this will allow us to take unwieldy volumes of data and reduce them to only contain the necessary level of complexity for a certain task. It took me back to a dimly remembered time when I would talk about “Shannon Information Content” and similar concepts.

I first heard Saira Kazmi speak at the 2017 Bio-IT World, when she was still with the Jackson Laboratory. She had earned a reputation as Jax’s “queen of metadata.” She combined a handful of deceptively simple techniques with an impressively diplomatic tenacity to create a sort of ad-hoc data lake – without ever pausing to go through the most painful parts of the data lake process. Instead they chose to archive first, scrape file headers into a JSON format and stuff it into a NoSQL database, and (my favorite) stored checksums of large primary data files in a database to identify duplicates and support provenance tracking.

Finally, we had ‏Annerose Berndt (@AnneroseBerndt) – who has just finished standing up a genome sequencing center to serve the UPMC hospitals. I asked her to hold forth a bit on security, compliance, quality systems, and other absolutely necessary bits of process discipline.

We shared a wide-ranging and illuminating conversation building on these topics. It was a blast.

As I said from the stage: I really cannot believe that it’s somehow part of my job to have conversations like this, with people of this caliber. How cool!

Molecular Medicine Tri-Conference

I’m in San Francisco this week, attending the Molecular Medicine Tri-Conference. I’m specifically focused on a new track called Converged IT and the Cloud. I’m paying particular attention and taking notes – both because it’s very interesting and exciting, because also because at the end of three days I get to moderate a panel discussion. We’ve set ourselves the challenge of accepting any and all conversation topics and questions – which … even given the talks so far … is going a very broad landscape of awesome, interesting, challenging, and occasionally scary stuff.

In an attempt to break the tyranny of locality and provide a bit of access – if you have a question or topic for the panel, please send it to me (comment, direct message, email, text message, or whatever). I will commit to summarizing the topics here after the conference.

On a personal note, it’s amazing to reflect on how important this community has been in my life and career. Kevin Davies opened the session. Seeing him brought back memories of the uppity startup magazine he created, called Bio-IT World. That magazine developed a mutually supportive relationship with a scrappy new consulting company called Bioteam. By hook and crook, hard work and happy accident, we’re all still bumping along the road, showing up at the same conferences, and working to improve the world together. Along the way, I’ve seen collegial work relationships turn into deep and lasting friendships.

It’s pretty cool. I’m happy to be here.

The Multi-Protocol Fantasy

There’s a particular class of problem that I’ve grappled with regularly over the past decade. I got to go another round just before the holiday break. I figured that this was as good a time as any to share some thoughts.

Multiple representations

Some file servers are able to present a single filesystem via two (or more) different protocols. The most common use case is to support both Linux and Windows clients from the same underlying data. In the usual setup, Linux systems see a POSIX compliant filesystem via their old reliable Network File System (NFS) protocol. Windows clients use the Windows native CIFS/Samba and see a friendly, ordinary NTFS filesystem (sometimes mapped as “The L: drive” or similar). In recent years, vendors and developers have begun to add support for newer systems like S3 and HDFS.

We use this sort of thing all the time in the life sciences. Lab instruments (and laboratory people) write data to a Windows share. The back end heavy lifting of the high performance computing (HPC) environment is done on Linux servers.

The reason that this seems like a good idea is that it’s kind of a pain to convince Linux systems to mount via CIFS / Samba, and it’s also kind of a pain to do the reverse and mount NFS from Windows. While it is certainly possible, all of us of a certain age will respond with a grouchy “harumph,” whenever the idea comes up. I’ve spent many happy hours navigating the complexity of the “unix services” tab on the active directory master (assuming that a lowly Linux admin is allowed access to the enterprise identity management service) or else googling around (again) to find that one last command to convince Linux to “bind” itself to an LDAP server.

The underlying problem

POSIX and NTFS are just different. As soon as we stray beyond the very simplest use cases (writing from a lab-user account and reading from a pipeline analysis account, for example), we encounter situations where it is literally impossible to give correct semantics on both sides. One filesystem has to present an invalid answer in order to provide correct behavior in the other.

Under POSIX (the Linux / NFS flavor) we use read, write, and execute permissions for a single user (the owner), a single group, and a conceptual “everybody.”

Under Windows, we have “access control lists” (ACLs). Any entity from active directory (either users or groups) can have read or write permissions. The idea of a file being executable is handled in a different way.

It’s straightforward to create permissions under windows that cannot be directly applied on the Linux side. The simplest example is a file for which exactly two users have read access – but where there is no group comprised of only those two users.

Note that this is true even if all of the users and groups are already correctly mapped between Windows to Linux.

Note further that as you start creating utility groups to overcome this “bug,” you will rapidly discover that NFS only honors the first 16 groups listed for a user – even if those groups are served up by an NIS or an LDAP service.

There are plenty of other examples. The rat’s nest I found myself in last month involved setting default permissions on files created by Windows such that they would be sensible on the Linux side. Under Linux, each user has a single default group (specified by number in a file called /etc/passwd). That concept, of a default group for each user, simply doesn’t exist under Windows.

Go around

Back in November, I wrote a post titled Go Around, about how, sometimes, when you hit an intractable obstacle it just means you’re looking at a problem from the wrong angle. That’s exactly what’s going on when you spend time yelling at a vendor or deforming your active directory forest to solve these protocol collisions. It’s asking the impossible.

As an engineer or informatics person, it is up to you to find a way to frame problems such that they are actually solvable.

There is no such thing as a free lunch on this one. If you want to use POSIX semantics, you have to accept the limitations of POSIX. The same is true of Windows, S3, HDFS, or any other data representation.

Vendors: Multi-protocol permissions blending will never ever work. It just looks like buggy, flaky behavior on the client side. Stop trying.

Users: Go around.

Addendum

The way that you “go around” on this one is to decide which protocol the majority of usage will come from and engineer for that.

If it’s going to be used mostly from Linux – then don’t set complex permissions under Windows, and tell the users that their fancy ACLs are not going to be honored. If it’s the other way around (mostly Windows use), then create a utility group on the Linux side (just the one) and make sure that all the appropriate users are in that group.

Trust me, it’s much easier than trying to merge fundamentally incompatible semantics.

Two clouds and a bicycle

Yesterday, I got to solve a puzzle that I’ve solved more than a few times before. “How do we get a moderate amount of data from here to there in reasonably short order?”

The specifics in this sort of puzzle change all the time. What constitutes a moderate amount of data? What sort of distance matters between “here,” and “there?” Is it miles? vendors? representation?

Other factors, like the timeline, remain constant. “Reasonably short order,” usually means “by tomorrow morning,” give or take. It’s a short, but achievable timeline where smart decisions to reduce latencies in the real world (start immediately!) matter as much or more than decisions about transfer protocols or software tools that might give a more perfect use of resources.

It’s a case where the perfect is the enemy of the good, and where having seen puzzles like this before confers an advantage. As is true throughout life, having friends helps too.

Here’s the story:

500GB on a jump drive

Diamond Age Data Science is a consulting company in the business of computational biology and bioinformatics. The founder texted me in the early afternoon: They had agreed to do a a particular analysis, “before the end of the year.” They planned to use Amazon’s cloud, and the source data was on a 1TB USB disk at LabCentral in Cambridge, MA. The customer had tried a sensible thing – plug the disk into a laptop and use the GUI to drag and drop the files.

The time estimate was FOUR DAYS from now.

In real terms, this would have consumed one of the two weeks remaining before the end of year holidays. This would put the deadline at risk, and had the potential to interfere with a laptop-free vacation.

In addition, the customer wanted to take their laptop home in the evenings even during the data transfer. That meant stopping and restarting, or perhaps lashing together an impromptu workstation.

It was only about 500GB of data. 500 gigabytes is 4,000 gigabits. At one gigabit per second (a relatively standard wired connection in a modern office), this was not less than 66 minutes worth of transfer. Even accounting for inefficiencies on the order of a factor of two or three – this shouldn’t have been an overnight problem – much less a week of delay.

I happened to be about a 10 minute bicycle ride from the lab in question, and this is a game I like to play.

To make it more fun, I decided to up the ante from “tomorrow morning” to “before dinner.”

Could I move the data and still get home in time for dinner with a friend from out of town? I figured that it was reasonably possible, but I couldn’t afford to go in the wrong direction for even an hour.

As an aside, overhead and protocol challenges are a really big deal in large scale data transfers in clinical medicine, finance, satellite, media/entertainment, and so on. For a graduate education on this topic, read up on Michelle Munson’s Aspera. Start with the FASP protocol.

Which way to the cloud

There are a lot of places in Boston and Cambridge with good connectivity. I figured that I could call in a favor with friends at Harvard’s research computing, at the Broad Institute, or even at one of the local pharmaceuticals. Any of these could have worked out fine.

However, all of those groups use fiber optic connections that terminate at the same building: The Markley Data Center at Downtown Crossing. It’s perhaps the single best connected building in New England. Also, the people who run the place are smart and like to play this sort of game.

So I picked up the disk and biked across the river. It took about 20 minutes.

My bicycle was about a 3Gb/sec transfer device – with latencies on the order of 1,200,000 milliseconds.

Linux Servers are Free

There’s an old saying – you never eliminate a bottleneck, you just move it around. Since I wanted to be home in time for dinner, I wanted to think through, rather than discover those bottlenecks. One of the big potential bottlenecks was the wireless card on my laptop. Most wireless networks are not full 1Gb/sec. My Apple laptop, for all that I love it, does not come with a traditional RJ45 network plug. I was going to need a wired connection.

So I texted ahead and asked the folks at Markley if I could borrow a laptop or a workstation to plug the disk into. This was also a hedge – assuming that we could get the data transfer rolling, I wouldn’t be in the same situation as the folks across the river – waiting on a data transfer before unplugging my laptop.

In the same conversation (stopped by the curb on the Longfellow bridge), we decided to create a virtual machine on Markley’s private cloud for use as a staging machine. I’ve been trapped in enough data centers over the years, babysitting some dumb process while my friends ate dinner without me. So I requested a bit of infrastructure. About 90 seconds later, I had an email in my INBOX including some (but not all) of the credentials needed to log into a dedicated internet connected server with a terabyte of attached disk.

The big advantage of the VM was that it would be trivial for me to reach it from home. Also, I could install whatever software I needed on it without inconveniencing the Markley staff or forcing them to re-image the loaner laptop. The in-building network is extremely “clean” with almost zero packet loss and sub-millisecond latencies (better than a bicycle by about 7 orders of magnitude). I figured that it was very likely that we could get the data off of the disk to something in-building in a couple of hours without any drama. The wider internet is not necessarily so clean and consistent.

I was, at this point, assuming that something would go wrong, and that I would be logging in after my guests went home to bump the process along.

Besides, it’s a 90 second task to configure a virtual machine on a private cloud.

So by the time I got to Downtown Crossing, there was a workstation waiting for me, as well as a virtual machine that I could log into. The nice folks there had also found a Thunderbolt to RJ45 adapter, in case I didn’t want to bother with the loaner laptop. On a whim, mostly curious how fast we could make this happen, I plugged the disk into my own laptop and started a simple “scp” of a test file.

Too slow!

I was getting 3 megabytes per second.

3 megabytes per second is 24 megabits per second. About 2.3% of the number that I had been using for my estimates. That put me right back into the “several days” estimate from before. My friend, sitting next to me, tried the same simple copy from the internal disk on his laptop to the server (connected via an identical connection) and got consistent performance of 80 megabytes per second.

In the spirit of not overthinking it (and getting home in time for dinner!) we tested two things at once by swapping the disk over to his machine and starting the in-house copy to the VM. Why mess with it? It was moving into mid-afternoon now, and we wanted this first phase done before we were anywhere close to stressed.

I should note that, at that point, it could -totally- have been the disk. In that case, we would have been stuck. Fortunately, the connection screamed to life at 80MB/sec. He set to work starting a few batches of copies to run in parallel, which squeezed another 10MB/sec or so out of the connection. 90 megabytes per second is 720 megabits per second, or 72% of the theoretical maximum of the connection. Our theoretical 66 minute transfer was going to take 90 minutes to get to the VM.

Duh

It wasn’t until half an hour later that I realized the problem with my laptop: I still had my VPN configured. After all that bicycling and thinking I was so clever – my laptop was doing exactly what I had set it up to do, and spraying packets all over the country, from Seattle to Georgia, so that whatever coffee shop, airline, or hotel I happened to be connecting from couldn’t snoop on me.

After a good laugh, we agreed that I have a really good VPN, all told. Also, we let the copy run, it was already about 1/3 of the way done.

The Last Bit

The final step was almost laughably easy. I wanted the data to wind up on Amazon’s S3. My friend at Diamond Age had configured a user account for me, and had picked a name for the bucket.

It turns out that enough people have solved this problem before that it was mostly copy-pasting examples from the Amazon Command Line FAQ. There was the tiniest bit of Yak Shaving to get Python, PIP, and the Amazon CLI configured on the VM – but after that I ran a command very much like:

aws sync /data s3://my-awesome-bucket

And was rewarded with transfer rates on the order of 66MB/sec.

I verified that I could get ~60 megabytes per second consistently from the private cloud to the public cloud, and also that we could do both transfers concurrently. The link from the laptop to the VM was still contentedly chugging along at 90MB/sec.

I also verified that I could log into the VM from outside of the data center. I would have felt pretty silly to have to come back the next day to do any necessary cleanup.

My friend and I caught up a bit about our plans for the holidays. Then I biked home as the light faded.

By the time I got home, the initial sync command was done, and all the data from the disk was present on the VM. I ran the sync again, with the options to do a detailed check for any differences that might have crept in by copying files in these two stages.

It finished up while I was writing this post, after dinner.

DeepVariant

Earlier this week, Google published DeepVariant, a machine learning (ML) based tool for genomics. The software is now available on the DNANexus platform.

This is kind of a big deal, and also kind of not a big deal.

Does it matter?

It’s a big deal in the same way that ML systems exceeding the performance of radiologists on diagnostic classification of images is a big deal. Sure, it’s a little creepy and intimidating when a computer program exceeds a respected and trained professional at one of their tasks. On the other hand, it would take a spectacularly naive and arrogant person to claim that a radiologist’s only job is image classification.

It’s not a big deal because there is still so much domain expertise required to derive scientifically meaningful results from genomic data, and because these methods are still changing all the time.

The DeepVariant team took one of the points in the genomic analysis workflow where scientists have historically used eyeballs and intuition to identify subtle patterns in the data. Prior variant callers were built atop that intuition, coding it into complex algorithms. That’s why there was a well characterized image format (Pileup) already available as a starting point for the project – scientists still want to look at the results of their callers to see if the results align with intuition.

That’s why there was a contest for the team to win. Because we’re still figuring this stuff out.

It was a good place to start, and the system performed much as we might expect.

Much to Learn

I saw a preview of this technology at the Broad Institute, sometime in mid to late 2016. We were all really impressed. I remember that someone asked exactly the right question: “Can it -discover- a new sort of biological artifact or feature? One that we haven’t seen before?”

The team was unambiguous: Of course it can’t. Until the patterns are present in the training data, there’s nothing there to learn. Further, this particular approach will never suggest that, maybe, we’re looking at the problem sideways.

Put another way: There is a lot of genomic biology still to be learned.

Every year that I’ve been in and around this field, there has been at least one discovery that has up-ended major pieces of institutional knowledge and dogma. Formerly safe assumptions get melted down, combined with new insights, and formed into superior alloys all the time.

The more subtle challenge

There is a more subtle challenge in this particular case: We’re dealing with measurements rather than facts here. The process of DNA sequencing is complex and subtle, with biases and omissions and room for improvement throughout. The way that this particular test was framed up assumes that there is one unambiguous correct answer to the question of variation, and that we already know that answer.

A genomic biologist – or scientist of any stripe – has to hold two truths in their head at the same time: They must gather data to answer questions, and they must also accept that the data may suggest refinements to the question itself. Those refinements to the question, the ones that call existing knowledge in question – that’s where the real innovation happens.

Given enough data, machine learning now excels at answering well formed questions. The task of questioning our assumptions and changing the question itself remains much more subtle.

The take home

The short version is that computers are here, right now, to take away any part of any job that involves memorizing a large corpus of data and then identifying new examples of old categories based on patterns in that data. This is just as true for eyeballing pileup images as it is for reading ZIP codes or license plates.

Machine learning is also here for any part of your job in which you merely turn the crank on a bunch of rules and formulas. This has already impacted a bunch of different jobs: Tax preparation, law, real estate, and travel planning have all undergone radical changes in the last decade.

One final thought: This is also a big deal because while it takes massive computation to create a recognizer like DeepVariant, it is trivial to use that recognizer on any particular input. Variant calling in the old model takes up a lot of CPU power – which can now be turned (hopefully) to more subtle questions.