Proof of work and the nonce

The blockchain technology ecosystem brings together a diverse set of codes and algorithms that have been developed over the past 50-ish years. It includes decades old cryptographic techniques like hashing and symmetric/asymmetric key encryption, and also includes relatively recent innovations related to distributed consensus.

The Blockchain ecosystem reminds me of the classic radio tag-line: It’s the best of the 80’s, 90’s, and today.

Proof of work is one component of that ecosystem. It is used to prevent denial of service attacks, in which large numbers of messages swamp and degrade a system. The system works by imposing a computational cost on the creation of valid messages. Receivers check whether messages are valid before they pay any attention to the contents.

The proof of work described in the original Blockchain paper is based on a system called Hashcash, that was developed in 1998 to combat spam email. The sender is required to find a value called a nonce that is specific to a particular message, and that demonstrates that they put effort into creating the message. A valid nonce is rare to find by chance, but easy to verify once found.

This property – numbers and relationships that are challenging to find, but trivial to verify – is the basis of most of modern cryptography. Hash functions are one example. A hash function takes arbitrary input and returns a value within a fixed range. In a good cryptographic hash, the result (sometimes simply called the “hash” of the input) is randomly distributed across that range. It is difficult to author an input to get any particular hash value.

The hashcash algorithm is simple: The nonce is combined with the message to be sent, and the combination is hashed. The hash result must be small relative to all possible hash results. Exactly how small is a parameter that can be used to tune the algorithm.

For example, if the hash function returns a 256 bit value, there are 2256 possible results. If we insist the nonce be a value that makes the first 16 of those bits ‘0’, we are insisting that senders find one of 2240 values from among 2256 possible hash results. The probability of this happening by random chance are one in 216, or something like 1 in 65,000.

On average (assuming that we have picked a good hash function) senders will have to try 216 nonces before finding a valid one. If we assume that each hash takes 1 second to calculate on a single CPU, the sender would invest (on average) slightly under a CPU day per message.

In the email system proposed in 1998 (I would love to use something like this, by the way) senders invest some amount of computation in creating a nonce for each message. Receivers sort or apply thresholds based on the value of the hash. Low numbered hashes represent an investment in the message. Human beings who type or dictate messages to small numbers of recipients won’t even notice the additional compute effort. Mass marketing campaigns will be expensive.

This exact computation is the work of “mining” in the Bitcoin network. The language of “mining” or “finding” bitcoins obscures the fact that we’re actually searching for nonces.

Of course, compute power keeps getting cheaper, so we need to have a flexible system. Fortunately, the tunable parameter of the nonce makes this simple. If compute performance on hash functions were governed by Moore’s law (it’s actually a bit more complex), then we would need to increase the strictness of our nonce by one bit every two years.

The Bitcoin network has been tuning its proof of work to produce valid blocks at a remarkably consistent rate of about one every ten minutes since 2010.

P.s: Thanks to Eleanor of Diamond Age Data Science for this post explaining the difference between probabilities and likelihoods. An earlier version of this post used the words incorrectly.

The unicorn rant

In biotech these days, I hear a lot of talk about “unicorns.” Sometimes they are rare fancy unicorns … purple, or glittery. At Bio IT World, I found myself moderating a conversation that involved herds and farms of these imaginary animals.

Of course, we were talking about finding and retaining top talent. In the staffing world, “unicorn” is the codeword for an impossibly ideal candidate with a rare mix of skills and experiences. My friends in the recruiting and staffing industries spend their days chasing unicorns. It seems really stressful for them.

Here’s the thing: Unicorns don’t exist.

I’m an engineer by training. I spend a lot of time designing and debugging complex systems. As a rule of thumb, if the plan relies on a continuous supply of something that is either vanishingly rare or (worse) nonexistent – it is a bad plan. When brainstorming, we might joke about knowing a reliable supplier of unobtanium. Sometimes we trot out the old cartoon with the guy saying “and then a miracle occurs.” Eventually. however, engineers sigh and set to work on a better plan.

Not so with many hiring managers, senior leaders, board members, and venture firms in biotech. From what I hear, the plan is to fight harder for the unobtanium, to hope for the miracle.

We need a better plan.

Before going further, I want to first reaffirm my commitment to finding and retaining the best people. Of course people make the difference. Of course we should be highly selective. And yes, of course there are massive, critical differences between candidates. It is a false comparison and a strawman argument to suggest that “making do with a third rate workforce, indiscriminately chosen,” is the only alternative to the unicorn quest.

There are three major pieces to building an organization that does not rely on unicorns:

  • Managers must assume the full time job of supporting and developing their teams.
  • Project plans, workflows, and team behaviors must err on the side of granular, achievable work – with mechanisms to self-correct when the plan is wrong.
  • Recruiting must focus on attitude and enthusiasm, not on finding the next hero.

The non-unicorn plan is straightforward to say, but requires diligent effort and consistency: Divide work into achievable pieces (planning, architecture, and project management are real jobs), hire enthusiastic and intelligent people (give recruiting and HR a fighting chance), and give those people the resources they need (management is a real job).

There’s plenty of literature on this, but you won’t find it in the sci-fi fantasy or the young adult section of the bookstore. Instead, do a quick google on “Hero culture.” You may find yourself reading about burnout, mythical man-months, success catastrophes, and flash-in-the-pan companies.

A more subtle pathology of the unicorn fetish is that it encourages the worst sort of bias and monoculture. When the written criteria are unachievable (unicorn!), then the hiring decision is actually subjective. Rejecting candidate after candidate based on “fit,” or poor interview performance is almost always a warning sign that we’re in bias and blind-spot territory.

As an aside, please recall that interviews are among the worst predictors of job performance.

From the candidate perspective, unicorn recruiting is simple: The best opportunities are only available to the people who have already had the best opportunities (the paper qualifications), and who give favorable first impressions to the hiring manager (bias and cronyism). From what I can see of the startup culture in both Boston and San Francisco, this is in fact the situation. In both cities, we have large populations of motivated people actively seeking work while recruiters work themselves to death. Meanwhile hiring managers make sci-fi/fantasy metaphors to support staffing plans that are based on miracles.

We can do better.

Finally, if none of that convinces you, then perhaps consider the traditional mythology about who, exactly, should be sent to capture a unicorn.

Either way, we’re doing it wrong.

The second decade of the cloud

In my talk at Bio-IT World this year, I made some comments about “cloud” technologies that I think bear repeating.

2017 is somewhere in the middle of the second decade of the cloud.

Of course, when I say “cloud,” I mean much more than mere virtualization. You don’t get the 2017-benefits of “going to the cloud” by just hosting your legacy architecture in Amazon’s east-coast-1 availability zone. Nor do you get them by putting your one-server-per-service enterprise on a fancy VMWare / ESX system, no matter how “hyperconverged” it may be. That’s the kind of misuse of the technology that has kept the “on-prem vs cloud” boondoggle alive so far past its expiration date.

Virtualization, of course, is a very good idea. Depending on how you define it, we’re in at least the third decade of OS level virtualization. That’s even more of a solved problem than the cloud.

The benefits of the cloud in 2017 accrue when you adopt cloud-native architectures. This entails substantially more work than porting a system to a hosted platform. It is also absolutely worth it for all but the longest of the long tail of legacy systems.

A bit of history: Amazon Web Services launched as a platform in 2002, and re-launched with EC2 and S3 in 2006. At the time, I worked for BioTeam. Less than a year later, in 2007, we noticed that every single member of the technical team had independently chosen at least one AWS based solution for a customer need. There was no corporate mandate – it was the right way to do the engineering.

At the time, I was responsible for many aspects of Bioteam’s “Inquiry” software product. By early 2008, we had ported our software to AWS and were offering it under license terms that still read pretty well, 9 years on.

While the FAQ above has aged well, that 2008 port of Inquiry looks pretty dusty in the bright lights of 2017. We took a legacy HPC / batch computing architecture and we virtualized it to run on AWS. There is certainly some forward-looking stuff in there, hosts that spin up and down in response to backlogs of work, and also some cleverness around staging data to and from S3. However, it bears little resemblance to the approach that one might take today.

Chris Dagdigian put it well at Bio-IT World: Many cloud-native system architectures do not have a direct “on-prem” analogue. In particular, Lambda and serverless architectures are challenging to explain in terms of the systems that we built in 2006.

As just one small example: On the Inquiry port, we spent a lot of time convincing our old faithful HPC job scheduler, Sun Grid Engine to be okay with hosts appearing and disappearing all the time. In our hosted, legacy architecture, SGE interpreted many aspects of the cloud as repeated failure. Compare that with even the most basic autoscaling architectures – to say nothing of the wizardry behind tools like Amazon’s Athena. Athena is frankly a bit mind-bending for somebody who made a good living less than 10 years ago making less-usable systems to do less-robust data analysis.

I find it clarifying to think about “cloud” from the perspective of a non technologist. When the CFO, COO, or CSO think of “cloud,” or articulate a “cloud first” strategy, they almost certainly have business or scientific metrics in mind, rather than technical niceties like where the metal happens to live. When executives ask for “cloud,” in my experience they are asking for things like:

  • Remove an entire category of off-mission task from the in-house team.
  • Make technology updates totally seamless and automatic.
  • Vastly simplify licensing and budgeting – budget in terms of headcount, not opaque version numbers and product families
  • Scale without limit, even in the event of a “success catastrophe.”

Note that merely virtualizing a legacy architecture onto Amazon, Google, or Microsoft (or yes, even one of the at-least-six-way-tie-for-fourth-place other public cloud providers) provides zero of the benefits for which we are sent off to “cloud.”

The good news: These benefits are, in fact, possible for scientific and high performance computing. It will not be as easy as it was with human resources or office productivity tools, but we will do it. And it will not be as simple as moving everything to AWS east-coast-1.

Blockchain

Over the summer, I have the opportunity to think deeply about the ecosystem of technologies that go by the name “Blockchain.” I’m focusing particularly on how these might apply in a couple of different scientific and healthcare contexts. I plan to post snippets here from time to time, as much to force me to clarify my thinking as anything else. As Hyman Rickover said, "Nothing so sharpens the thought process as writing down one's arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page."

One challenge when trying to talk about blockchain is that it is massively hyped – sitting right at the peak of Gartner’s hype cycle. Most of the meetings I go to these days include at least one person who asks, regardless of the topic at hand, “what about blockchain?”

Another challenge is that Blockchain is strongly associated with Bitcoin and other cryptocurrencies. The hype brings a certain breathiness to the conversation, while the finance connection brings associations with fraud and nefarious dealings. Neither of these is entirely merited – but I’m finding it important to keep in mind as I explore.

For all that, the foundational documents are remarkably crisp, lucid, and readable. The original bitcoin paper, written under the pseudonym “Satoshi Nakamoto,” is only eight pages long – plus a half page of references.

It’s clear to me that there’s important work to be done in this space – and I’m thrilled to have the time to take part.

Bio-IT World 2017

Last week, one of my favorite conferences celebrated its 15th year: Bio-IT World. This is one of only a few forums where my particular professional community comes together to share our experiences, re-connect, and to try to catch a glimpse of what the future might hold. I’ve been showing up since at least 2004, and it’s always great to see old friends and to meet new ones.

I’m on the committee that awards “Best Practices,” and also “Best In Show.” This year I didn’t actually get to walk around and see the teams present their work. I missed that – it’s a private tour of the most interesting products and exhibitors. We were amazed at the strength of the entries this year. We saw projects from clinical diagnostics, agricultural genomics, semantic reasoning, and so very much more. The choices are never straightforward, but this year turned some sort of a corner for me in terms of direct impact and technical maturity.

The most fun two hours of the show was moderating the re-imagined “Trends from the Trenches.” Chris Dagdigian has been giving that talk since 2008, and he finally won his multi-year battle to expand it beyond just him. We re-framed the session as a panel / mini-symposium including five Bioteam people. It was a thrill to try to keep up with that crew, live and in real time. We tried a couple of experiments in audience interaction while we were at it, Slido to submit and vote on topics, and a pair of Catchbox throwable microphones. There was immediate synergy between the two systems: The very first Slido question was “can you reach the back of the auditorium with the Catchbox?” I couldn’t, but got an audience assist and it worked out okay.

I also got to give a solo talk in the Data track, focused on a couple of practical points about cloud technologies and data management. I’ve put those slides online at Slideshare.

Thanks to everybody at CHI who worked so hard to put together a great show. See you next year!