Manufacturing improvements apply to HPC

The Strategy Board

My former colleagues at the Broad Institute recently published a marvelous case study. They describe, in a delightfully brisk and jargon-free way, some of the process improvements they used to radically increase the productivity of the genome sequencing pipeline.

This post is about bringing the benefits of their thinking to our high performance computing (HPC) systems.

The fundamental change was to modify the pipeline of work so that instead of each stage “pushing” to the next, stations would “pull” work when they were ready to receive it. This should be familiar to folks who have experience with Kanban. It also overlaps with both Lean and Agile management techniques. My favorite part of the paper is that they applied similar techniques to knowledge work – with similar gains.

The spare text of the manuscript really doesn’t do justice to what we called the “strategy board meeting.” By the time I started attending in 2014 it was a massive thing, with fifty to a hundred people gathering every Wednesday morning. It was standing room only in front of a huge floor-to-ceiling whiteboard covered with brightly colored tape, dry erase writing, and post-it notes. Many of the post-it notes had smaller stickers stuck on them!

Somehow, in an hour or less every week, we would manage to touch on every part of the operation – from blockers in the production pipeline through to experimental R&D.

My favorite part was that it was a living experiment. Some weeks we would arrive to find that the leadership team had completely re-jiggered some part of the board – or the entire thing. They would explain what they were trying to do and how they hoped we would use it, and then we would all give it a try together.

I really can’t explain better than the paper itself. It’s 100% worth the read.

The computational analysis pipeline

When I started attending those strategy board meetings in 2014, I was responsible for research computing. This included, among other things, the HPC systems that we used to turn the raw “reads” of DNA sequence into finished data products. This was prior to Broad’s shift to Google’s Cloud Platform, so all of this happened on a large but finite number of computers at a data center in downtown Boston.

At that time, “pull” had not really made its way into the computational side of the house. Once the sequencers finished writing their output files to disk, a series of automatic processes would submit jobs onto the compute cluster. It was a classic “push,” with the potential for a nearly infinite queue of Work In Progress. Classical thinking is that healthy queue is a good thing in HPC. It gives the scheduler lots of jobs to choose from, which means that you can keep utilization high.

Unfortunately, it can backfire.

One of the little approximations that we make with HPC schedulers is to give extra priority to jobs that have been waiting a long time to run. On this system, we gave one point of priority (a totally arbitrary number) for every hour that a job had been waiting. On lightly loaded systems, this smooths out small weirdnesses and prevents jobs from “starving.”

In this case, it blew up pretty badly.

At the time, there were three major steps in the genome analysis pipeline: Base calling, alignment, and variant calling.

In the summer of 2015, we accumulated enough jobs in the middle stage of the pipeline (alignment) that some jobs were waiting a really long time to run. This meant that they amassed massive amounts of extra priority in the scheduler. This extra priority was enough to put them in front of all of the jobs from the final stage of the pipeline.

We had enough work in the middle of the pipeline, that the final stage ran occasionally, if at all.

Unfortunately, it didn’t all tip over and catch fire at once. The pipeline was in a condition from which it was not going to recover without significant intervention, but it would still emit a sample from time to time.

As the paper describes, we were able to expedite particular critical samples – but that only made things worse. Not only did it increase the wait for the long-suffering jobs in the middle of the pipeline, but it also distracted the team with urgent but ultimately disruptive and non-strategic work.

Transparency

One critical realization was that in order for things to work, the HPC team needed to understand the genomic production pipeline. From a system administration perspective, we had high utilization on the system, jobs were finishing, and so on. It was all too easy to push off complaints about slow turnaround time on samples as just more unreasonable demands from an insatiable community of power-users.

Once we all got in front of the same board and saw ourselves as part of the same large production pipeline, things started to click.

A bitter pill

Once we knew what was going on, it was clear that we had to drain that backlog before things were going to get better. It was a hard decision because it meant that we had to make an explicit choice to deliberately slow input from the sequencers. We also had to choose to rate-limit output from variant calling.

Once we adopted a practice of titrating work into the system only at sustainable levels, we were able to begin to call our shots. We measured performance, made predictions, hit those predictions, fixed problems that had been previously invisible, and added compute and storage resources as appropriate. It took months to finish digging out of that backlog, and I think that we all learned a lot along the way.

All of this also gave real energy to Broad’s push to use Google’s cloud for compute and data storage. That has been transformational on a number of fronts, since it turns a hardware constraint into a money constraint. Once we were cloud-based we could choose to buy our way out of a backlog, which is vastly more palatable than telling world-leading scientists to wait months for their data.

Seriously, if your organization is made out of human beings, read their paper. It’s worth your time, even if you’re in HPC.