The ever gathering storm

It’s summertime – season of thunderstorms. Most days are punctuated with ominous clouds and distant thunder. Actual rain, however, is rare. The forecast is consistent – temperatures may spike up to uncomfortably hot in the afternoon, and there are low odds of a thunderstorm. I carry an umbrella all day, and then water my garden by hand.

It reminds me of our industry-wide set-piece about the how genomic data is so terribly huge (and growing so incredibly fast!) that it’s going to overwhelm everything.

We’ve been living in the shadow of a tidal wave of data for more than 10 years. Honestly, it’s a little awkward that we’re still sounding the alarm.

The first time the phrase “data tsunami” appeared in my slides was in a presentation from 2007. That was when the first wave of so-called “next-gen” DNA sequencing instruments were really coming into their own. Those instruments increased the velocity of DNA sequencing by around three orders of magnitude. They also reduced the per-base costs of sequencing by an independent three orders of magnitude. Taken together, we experienced about a millionfold increase in the rate of data production.

We observed at the time that this rate increase was in excess of Moore’s Law. Now, as genomic diagnostics and precision / personalized medicines finally make their way into the clinic, we’re making the same observation today. While it’s flattering to hear brag words like “genomical,” it’s also a bit misleading.

Because you know what? We kept up before, and we’ll keep up now. I think that we’re actually better prepared for this decade’s data deluge than we were for the last one.

Sure, there was blood, sweat, and tears – that’s the job of engineering. We changed and adapted untenable practices – including choosing to discard the raw output images from the high resolution cameras on the new sequencers. Instead we stored only the information that was actually useful to the scientists – at the time it was base pairs and quality scores from all the reads. That idea was a fight at the beginning. I recall hours of conversation with scientists incredulous that I would suggest that any data could ever be deleted. Today, you can’t even get the raw images off of the sequencers.

We upgraded the infrastructure of biology facilities for the genomic age. We planned and built high performance network connections all the way out to laboratories. We consolidated data-producing instruments into “cores,” provisioned with infrastructure to handle the network and data storage load. We shifted servers and storage out of aging lab buildings and into co-located data centers. We combined independent compute farms into time-shares on integrated high performance computing environments. We worked out cost recovery schemes to make sure that it was sustainable. As public and private clouds have matured, we’ve continued to evolve, and I’m sure that we will continue to do so.

We also upgraded our human relationships. We forged partnerships with the technologists who build data storage, network, and computing systems. Together, we adapted the tools and techniques already in use in media and entertainment, finance, and other industries to be better fits for the challenges of science. We sent computer science students to biology journal clubs, and vice-versa, and eventually recognized “bioinformatics,” and “computational biology,” as important specializations in their own rights.

We have a decade of trust, education, and mutually beneficial work to build on.

So while it is certainly flattering to hear people proclaim that “genomical” is a better adjective than “astronomical” to describe rapid data growth, I’m not convinced that it’s cause for anything other than enthusiasm. A decade ago it was Terabytes of genomic sequence data for research. Now it’s Petabytes, or even Exabytes, of patient records for precision medicine and genomic diagnostics.

We’re gonna be fine, people. Sure, carry an umbrella, but think of it as “rainbow weather.”

A cautionary tale

Earlier this month, an information security firm found a multi-terabyte dataset of personal information on at least 198 million American voters unsecured, in a world readable S3 bucket. They did the responsible thing and notified the owners, and then wrote a very accessible description of the situation.

It serves as a decent cautionary tale and metaphor for some of the privacy concerns we face in health care, life sciences, and genomic medicine.

This post is about blame.

Could we blame the coder? The specific mistake that led to the data exposure was in their continuous integration and deployment workflow. A code change had the unintended effect of disabling access controls on the bucket. While the person who checked in that code change certainly made a mistake, it was far from the root cause of the failure. We would be remiss (but in good company) to blame the coder.

Could we blame the cloud provider? I say “absolutely not.” While this sort of exposure is more common with public clouds, it would be radically incorrect to put the blame with the hosting company. Amazon provides robust tools and policies to protect their customers from exactly this sort of mistake. In the health care / life sciences space, they offer a locked-down configuration of their services. They require customers to use this configuration for for applications involving HIPAA data. These controls can be imposed at a contract level, meaning that business owners – even those who are not cloud-savvy – have every opportunity to protect their data.

The owners of the bucket chose not to employ Amazon’s guard rails – despite knowing that they were amassing an incredible mass of sensitive and private data on nearly every American.

Could we blame the information security firm? While it is not uncommon to blame the person who finds the door unlocked, rather than to the one who failed to lock it, I say “no.”

Could we at least blame the whole firm who owned the bucket? The answer is certainly “yes,” as with the coder above – but it would be a mistake to stop there. This should be an extinction-level-event for the organization responsible, with good reason. I think it would be a shame to fail to go all the way to the root cause.

Responsibility rests with the people who created the dataset. This is true no matter whether we’re talking about genomes, medical records, consumer / social media trails, or whatever. Much of the data in that set was from public sources. Still, we all know that the power of data grows geometrically in combination with other data. When you do the work of aggregating, cleaning, and normalizing diverse datasets – it is your responsibility to be aware of the privacy and appropriate usage implications.

This imposes an ethical burden on data scientists. We cannot just blame the cloud provider, the coder, the business leaders, or whoever else. If you make a dataset that has the potential for this scale of privacy violation, you have a responsibility to make sure that it is appropriately handled. Beyond any technical controls, you have a responsibility to be sure that it is appropriately used. This responsibility transfers: If you hire a team to do things like this, you have a responsibility to be sure they do it in an ethical and effective way.

I’m far too jaded to believe that legal culpability will reach much beyond the coder – but it should.

The game of kings

A very smart and well informed colleague recently shared a thought that disturbed me. I’m writing it here mostly to get it out of my head, and also in the hopes that the eminently quotable Admiral Rickover will once again be proved right: “Weaknesses overlooked in oral discussion become painfully obvious on the written page.”

Here’s the observation: Machine learning and Artificial Intelligence are become a game of kings. The field is now the competitive arena for the likes of Microsoft, Google, Amazon, Facebook, and IBM. When companies of this scale compete, they do so with teams of thousands of people and spend (in aggregate) billions of dollars. The people on these teams are not a uniform sampling of their industry, they are the elite – high level professionals with the freedom to be choosy about their jobs.

The claim is that this presents an insurmountable barrier of entry to anyone who is not on one of those teams. Prosaically, when the King’s Hunt is afield, those of us without the resources of a king are well advised to stay out of the way.

In his words: “If you want to have an impact in AI or ML, the only real choice is which of the billionaires you want to work for.” Further, if you want to use these technologies, the only real choice is which billionaire to buy from.

I find this to be depressing, but not necessarily flawed. It would be easy (and potentially even more accurate) to make the same argument about computational infrastructure in the age of public exascale clouds.

There’s also an insulting subtext to the argument: If you are working with or on ML and AI and are not working for or with a billionaire, your work is de-facto pointless. Further, all the most talented people are flocking to join the King’s teams – maybe it’s just that you didn’t make the cut?

Did I mention that this particular colleague works part-time for Google? It reminds me of the joke about Crossfit: “How do you tell that somebody does crossfit? Oh don’t worry, they’ll tell you.”

With all that said, I don’t buy it. I fall back on Margaret Mead’s famous quote: “Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it’s the only thing that ever has.”

I harbor a deep-seated optimism about people. Everywhere I go, individuals and small teams absolutely sparkle with creativity and intelligence. These people are not the ‘B’ players, sad that they couldn’t make the cut to join the King’s hunting team. For my entire career, brilliant, hardworking innovators and entrepreneurs have been disrupting established power structures and upending entire markets. They don’t do this by fielding a second tier team in the old game – instead they invent a new game and change the world.

So while the point may be valid for established commodities, it is a bridge too far (and quite the leap of ego) to write off the combined innovative energy of the whole rest of the world.

I would welcome conversation on this. It feels important.

The blockchain part of Blockchain

The blockchain data structure (which is a part of, but distinct from the larger Blockchain ecosystem) consists, perhaps unsurprisingly, of an ordered series of “blocks.”

In addition to a payload of data and a few other housekeeping values, each block (except the first one, the “genesis” or “origination” block) contains the hash of the previous block. As described in a previous post, hash values are easy to verify and challenging to fake. A block is valid if it contains the hash of its predecessor. A valid blockchain contains only valid blocks.

A valid blockchain demonstrates an order of events. One cannot create a block without referring to the prior one. If the hashes are correct, we know that the blocks were created in sequential order, and therefore that the data stored on the chain was also written in that order. We know the relative order in which the data was written (we can’t generate a subsequent block without all the prior ones). We also know that the data has not changed since being written (changes to a block will change the hash, and require changes to all future ones).

Notably, we get no promises at all concerning validity or security. Merely storing information in a blockchain data structure does not make it correct, complete, or private. In fact, since most Blockchain systems are distributed ledgers (the topic of a future post), information on the chain is somewhat radically public. Every node in most Blockchain networks eventually see every piece of data on the chain.

Bitcoin and some (but not all) Blockchain systems up the ante on what constitutes a valid block by adding a nonce. The nonce is a value that, added to a block, yields a hash with specific and rare properties. This imposes a cost, called “proof-of-work,” on creating blocks. When creating a new block, authors must try (on average) a large number of nonces until they find one that yields a valid hash. The point of this is to make it computationally challenging merely to create a single new valid block at the end of the chain, and prohibitive to go back and corrupt earlier blocks.

The computational work of “mining” in the Bitcoin system is actually just searching for valid nonces. This is sufficiently different from conventional mining that it bears saying: In the usual use of the word “mining,” we are seeking out and refining a valuable resource. In Blockchain systems that use proof of work, the rare and precious resource at hand is the trustworthiness of the system itself. Value is not removed by the mining operation – it is actually being created.

Proof of work and the nonce

The blockchain technology ecosystem brings together a diverse set of codes and algorithms that have been developed over the past 50-ish years. It includes decades old cryptographic techniques like hashing and symmetric/asymmetric key encryption, and also includes relatively recent innovations related to distributed consensus.

The Blockchain ecosystem reminds me of the classic radio tag-line: It’s the best of the 80’s, 90’s, and today.

Proof of work is one component of that ecosystem. It is used to prevent denial of service attacks, in which large numbers of messages swamp and degrade a system. The system works by imposing a computational cost on the creation of valid messages. Receivers check whether messages are valid before they pay any attention to the contents.

The proof of work described in the original Blockchain paper is based on a system called Hashcash, that was developed in 1998 to combat spam email. The sender is required to find a value called a nonce that is specific to a particular message, and that demonstrates that they put effort into creating the message. A valid nonce is rare to find by chance, but easy to verify once found.

This property – numbers and relationships that are challenging to find, but trivial to verify – is the basis of most of modern cryptography. Hash functions are one example. A hash function takes arbitrary input and returns a value within a fixed range. In a good cryptographic hash, the result (sometimes simply called the “hash” of the input) is randomly distributed across that range. It is difficult to author an input to get any particular hash value.

The hashcash algorithm is simple: The nonce is combined with the message to be sent, and the combination is hashed. The hash result must be small relative to all possible hash results. Exactly how small is a parameter that can be used to tune the algorithm.

For example, if the hash function returns a 256 bit value, there are 2256 possible results. If we insist the nonce be a value that makes the first 16 of those bits ‘0’, we are insisting that senders find one of 2240 values from among 2256 possible hash results. The probability of this happening by random chance are one in 216, or something like 1 in 65,000.

On average (assuming that we have picked a good hash function) senders will have to try 216 nonces before finding a valid one. If we assume that each hash takes 1 second to calculate on a single CPU, the sender would invest (on average) slightly under a CPU day per message.

In the email system proposed in 1998 (I would love to use something like this, by the way) senders invest some amount of computation in creating a nonce for each message. Receivers sort or apply thresholds based on the value of the hash. Low numbered hashes represent an investment in the message. Human beings who type or dictate messages to small numbers of recipients won’t even notice the additional compute effort. Mass marketing campaigns will be expensive.

This exact computation is the work of “mining” in the Bitcoin network. The language of “mining” or “finding” bitcoins obscures the fact that we’re actually searching for nonces.

Of course, compute power keeps getting cheaper, so we need to have a flexible system. Fortunately, the tunable parameter of the nonce makes this simple. If compute performance on hash functions were governed by Moore’s law (it’s actually a bit more complex), then we would need to increase the strictness of our nonce by one bit every two years.

The Bitcoin network has been tuning its proof of work to produce valid blocks at a remarkably consistent rate of about one every ten minutes since 2010.

P.s: Thanks to Eleanor of Diamond Age Data Science for this post explaining the difference between probabilities and likelihoods. An earlier version of this post used the words incorrectly.

The unicorn rant

In biotech these days, I hear a lot of talk about “unicorns.” Sometimes they are rare fancy unicorns … purple, or glittery. At Bio IT World, I found myself moderating a conversation that involved herds and farms of these imaginary animals.

Of course, we were talking about finding and retaining top talent. In the staffing world, “unicorn” is the codeword for an impossibly ideal candidate with a rare mix of skills and experiences. My friends in the recruiting and staffing industries spend their days chasing unicorns. It seems really stressful for them.

Here’s the thing: Unicorns don’t exist.

I’m an engineer by training. I spend a lot of time designing and debugging complex systems. As a rule of thumb, if the plan relies on a continuous supply of something that is either vanishingly rare or (worse) nonexistent – it is a bad plan. When brainstorming, we might joke about knowing a reliable supplier of unobtanium. Sometimes we trot out the old cartoon with the guy saying “and then a miracle occurs.” Eventually. however, engineers sigh and set to work on a better plan.

Not so with many hiring managers, senior leaders, board members, and venture firms in biotech. From what I hear, the plan is to fight harder for the unobtanium, to hope for the miracle.

We need a better plan.

Before going further, I want to first reaffirm my commitment to finding and retaining the best people. Of course people make the difference. Of course we should be highly selective. And yes, of course there are massive, critical differences between candidates. It is a false comparison and a strawman argument to suggest that “making do with a third rate workforce, indiscriminately chosen,” is the only alternative to the unicorn quest.

There are three major pieces to building an organization that does not rely on unicorns:

  • Managers must assume the full time job of supporting and developing their teams.
  • Project plans, workflows, and team behaviors must err on the side of granular, achievable work – with mechanisms to self-correct when the plan is wrong.
  • Recruiting must focus on attitude and enthusiasm, not on finding the next hero.

The non-unicorn plan is straightforward to say, but requires diligent effort and consistency: Divide work into achievable pieces (planning, architecture, and project management are real jobs), hire enthusiastic and intelligent people (give recruiting and HR a fighting chance), and give those people the resources they need (management is a real job).

There’s plenty of literature on this, but you won’t find it in the sci-fi fantasy or the young adult section of the bookstore. Instead, do a quick google on “Hero culture.” You may find yourself reading about burnout, mythical man-months, success catastrophes, and flash-in-the-pan companies.

A more subtle pathology of the unicorn fetish is that it encourages the worst sort of bias and monoculture. When the written criteria are unachievable (unicorn!), then the hiring decision is actually subjective. Rejecting candidate after candidate based on “fit,” or poor interview performance is almost always a warning sign that we’re in bias and blind-spot territory.

As an aside, please recall that interviews are among the worst predictors of job performance.

From the candidate perspective, unicorn recruiting is simple: The best opportunities are only available to the people who have already had the best opportunities (the paper qualifications), and who give favorable first impressions to the hiring manager (bias and cronyism). From what I can see of the startup culture in both Boston and San Francisco, this is in fact the situation. In both cities, we have large populations of motivated people actively seeking work while recruiters work themselves to death. Meanwhile hiring managers make sci-fi/fantasy metaphors to support staffing plans that are based on miracles.

We can do better.

Finally, if none of that convinces you, then perhaps consider the traditional mythology about who, exactly, should be sent to capture a unicorn.

Either way, we’re doing it wrong.

The second decade of the cloud

In my talk at Bio-IT World this year, I made some comments about “cloud” technologies that I think bear repeating.

2017 is somewhere in the middle of the second decade of the cloud.

Of course, when I say “cloud,” I mean much more than mere virtualization. You don’t get the 2017-benefits of “going to the cloud” by just hosting your legacy architecture in Amazon’s east-coast-1 availability zone. Nor do you get them by putting your one-server-per-service enterprise on a fancy VMWare / ESX system, no matter how “hyperconverged” it may be. That’s the kind of misuse of the technology that has kept the “on-prem vs cloud” boondoggle alive so far past its expiration date.

Virtualization, of course, is a very good idea. Depending on how you define it, we’re in at least the third decade of OS level virtualization. That’s even more of a solved problem than the cloud.

The benefits of the cloud in 2017 accrue when you adopt cloud-native architectures. This entails substantially more work than porting a system to a hosted platform. It is also absolutely worth it for all but the longest of the long tail of legacy systems.

A bit of history: Amazon Web Services launched as a platform in 2002, and re-launched with EC2 and S3 in 2006. At the time, I worked for BioTeam. Less than a year later, in 2007, we noticed that every single member of the technical team had independently chosen at least one AWS based solution for a customer need. There was no corporate mandate – it was the right way to do the engineering.

At the time, I was responsible for many aspects of Bioteam’s “Inquiry” software product. By early 2008, we had ported our software to AWS and were offering it under license terms that still read pretty well, 9 years on.

While the FAQ above has aged well, that 2008 port of Inquiry looks pretty dusty in the bright lights of 2017. We took a legacy HPC / batch computing architecture and we virtualized it to run on AWS. There is certainly some forward-looking stuff in there, hosts that spin up and down in response to backlogs of work, and also some cleverness around staging data to and from S3. However, it bears little resemblance to the approach that one might take today.

Chris Dagdigian put it well at Bio-IT World: Many cloud-native system architectures do not have a direct “on-prem” analogue. In particular, Lambda and serverless architectures are challenging to explain in terms of the systems that we built in 2006.

As just one small example: On the Inquiry port, we spent a lot of time convincing our old faithful HPC job scheduler, Sun Grid Engine to be okay with hosts appearing and disappearing all the time. In our hosted, legacy architecture, SGE interpreted many aspects of the cloud as repeated failure. Compare that with even the most basic autoscaling architectures – to say nothing of the wizardry behind tools like Amazon’s Athena. Athena is frankly a bit mind-bending for somebody who made a good living less than 10 years ago making less-usable systems to do less-robust data analysis.

I find it clarifying to think about “cloud” from the perspective of a non technologist. When the CFO, COO, or CSO think of “cloud,” or articulate a “cloud first” strategy, they almost certainly have business or scientific metrics in mind, rather than technical niceties like where the metal happens to live. When executives ask for “cloud,” in my experience they are asking for things like:

  • Remove an entire category of off-mission task from the in-house team.
  • Make technology updates totally seamless and automatic.
  • Vastly simplify licensing and budgeting – budget in terms of headcount, not opaque version numbers and product families
  • Scale without limit, even in the event of a “success catastrophe.”

Note that merely virtualizing a legacy architecture onto Amazon, Google, or Microsoft (or yes, even one of the at-least-six-way-tie-for-fourth-place other public cloud providers) provides zero of the benefits for which we are sent off to “cloud.”

The good news: These benefits are, in fact, possible for scientific and high performance computing. It will not be as easy as it was with human resources or office productivity tools, but we will do it. And it will not be as simple as moving everything to AWS east-coast-1.

Blockchain

Over the summer, I have the opportunity to think deeply about the ecosystem of technologies that go by the name “Blockchain.” I’m focusing particularly on how these might apply in a couple of different scientific and healthcare contexts. I plan to post snippets here from time to time, as much to force me to clarify my thinking as anything else. As Hyman Rickover said, "Nothing so sharpens the thought process as writing down one's arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page."

One challenge when trying to talk about blockchain is that it is massively hyped – sitting right at the peak of Gartner’s hype cycle. Most of the meetings I go to these days include at least one person who asks, regardless of the topic at hand, “what about blockchain?”

Another challenge is that Blockchain is strongly associated with Bitcoin and other cryptocurrencies. The hype brings a certain breathiness to the conversation, while the finance connection brings associations with fraud and nefarious dealings. Neither of these is entirely merited – but I’m finding it important to keep in mind as I explore.

For all that, the foundational documents are remarkably crisp, lucid, and readable. The original bitcoin paper, written under the pseudonym “Satoshi Nakamoto,” is only eight pages long – plus a half page of references.

It’s clear to me that there’s important work to be done in this space – and I’m thrilled to have the time to take part.