Identity

Another day, another data breach.

The Swedish government has apparently exposed personal identifying data on nearly all of their citizens. The dataset came from the ministry of transportation. It included names, photographs, home addresses, birthdates, and other details about citizens – as well as maintenance data on both roads and military and government vehicles. Perhaps most squirm-inducing, the dataset included active duty members of the special forces, fighter pilots, and people living under aliases as part of a witness protection program.

The data has been exposed since at least 2015. We’re just finding out about it now.

I have written in the past about the perils of compiling this sort of dataset. This particular ministry has a good excuse: They print identification cards. The fact that they emailed the information around in clear-text and handed management and storage off to third party processors with little or no diligence? That’s another story.

It provides a decent opportunity to talk about identity and zero knowledge proofs.

Identity is one of those concepts that appears simple from a distance, but that aways seems to wriggle out of any rigorous definition.

For today, let’s say that identity is a set of properties associated with a person. We use these properties (or knowledge of them) to verify that someone is who they say they are. We can deal with group identities and pseudonyms in another post. Let’s also agree to defer metaphysics and philosophy around any deeper meaning of the word “identity,” at least for the moment.

My name, birthdate, address, social security number, fingerprints, bank account numbers, current and past addresses, first pet, high school, mother’s maiden name, and so on are all properties attached to and supporting “my” identity. This list includes examples commonly used by banks and websites. When someone calls my bank on the phone and claims to be me, the bank might ask for any or all of the above. As the answers provided by the caller match the ones in the bank’s database, the bank gains confidence that the caller is actually me.

Once a birthday, address, or other similar fact is widely known, it becomes substantially less useful in demonstrating identity. It also becomes substantially easier for people to fake an identity.

This data breach brings a particular problem into stark relief: Our identity cards have all sorts of identifying information printed on them, and that information is available to anybody holding the card (or the database from which it came).

The bartender doesn’t need to know my birthday – they need to know that I am of legal age to buy alcohol. They certainly don’t need to know my address or organ donor status.

This is where zero knowledge proofs come in. A zero knowledge proof is an answer to a question (“is this person of legal drinking age?”) that does not expose any unnecessary information (like date of birth or address) beyond that answer.

In order to implement zero knowledge proofs we usually need a trusted third party who holds the private data and provides the answers. Instead of printing dates of birth on ID cards, we might print a simple barcode. The bartender would scan the barcode with a phone or other mobile app, and receive a “yes” or a “no” answer immediately from the appropriate agency. In some cases, the third party might send me a message letting me know that somebody scanned my ID card. In some cases (like financial transactions), they might even wait for me to validate the request before sending the approval.

If the third party is trustworthy, having them in the loop can radically increase our information security – both by reducing information leakage and by providing a trail of requests for information. Imagine a drivers license that did not contain your private information, and could be invalidated as soon as you reported it lost.

Blockchain technologies seem likely to provide a robust solution to the question of a trusted third party in a trust-free environment. More on that in a later post.

The oldest part of Blockchain

Public key encryption, or PKE, is one of the oldest techniques in the blockchain toolbox. PKE dates from the 1970s and has a lineage of being “discovered” by both military and civilian researchers. It’s powerful stuff: One of the early implementations of a PKE system, called “RSA,” was famously classified as a munition and subject to export control by the United States government.

While PKE (also called “asymmetric key”) is a critical technology in Blockchain systems, I care about it mostly because I get a lot of email. With PKE it is conceptually straightforward to encrypt and “sign” a message in such a way that the identity of the sender is publicly verifiable and that the intended receiver is the only one who can open it. I’ll explain why that matters for my INBOX further on in this post.

Most of the algorithms that underpin PKE make use of pairs of numbers – called “keys” – that are related in a particular way. These “key pairs” are used as input to algorithms to encrypt and decrypt messages. A message that has been encrypted with one of the keys in a pair can only be decrypted using the matching key. As with crytographic hashes, these systems rely on the fact that while it is straightforward to create a pair of keys, it is computationally impractical to guess the second key in a pair given only the first.

This is conceptually distinct from “symmetric” key algorithms, which use the same key for both encryption and decryption.

In one common use of PKE, one half of a key pair is designated as “public,” while the other is “private.” We share the public key widely, posting it on websites and key registries. The private key is closely held. If someone wants to send me a message, they encrypt it using my public key. Since I’m the only one with the private partner to that public key, I’m the only one who can decrypt the message.

Similarly, if the sender wants to “sign” their message, they can encrypt a message using their private key. In this case, only people with access to the public key will be able to decrypt it. This is, of course, not very limiting. Anybody in the world has access to the public key. However, it is still useful, because we know that this particular message was encrypted using the private partner to that public key.

What is particularly cool is that we can “stack” these operations, building them one on top of the other. A very common approach is to encrypt a message twice, first using the sender’s private key to provide verification of their identity, and then a second time using the recipient’s public key, to ensure that only the recipient can open the message.

Many Blockchain systems use this system to verify that the person (or people, or computer program) authorizing a transaction is in fact allowed to do so. In fact, because key pairs are cheap and plentiful, every single Bitcoin transaction has used a unique pair of keys, created just for that one event.

Back to my surplus of email: None of my banks or healthcare providers have deployed this nearly 40 year old capability for communicating with me. Instead, a growing fraction of my inbound email consists of notifications that I have a message waiting on some “secure message center.” I am exhorted to click a link and sometimes required to enter my password in order to see the message.

This practice is actively harmful. Fraudulent links in emails are among the primary vectors by which computers are infected with malware. When we teach the absolute basics of information security, “don’t click the link,” comes right after “don’t share your password,” but before “we will never ask for your password.”

Email systems that use PKE have been around since I’ve been using technology, and somehow my bank and my hospital haven’t caught on. The HIPAA requirement to use “secure messaging,” has driven them backwards, not forwards.

Perhaps if we call it “Blockchain messaging,” it’ll finally catch on.

The answer is “Hybrid”

“Hybrid,” is the answer.

I’m talking about “on-prem vs cloud,” the bugaboo trick question that has plagued us for nearly a decade.

What I mean by reframing the question like this is that – absent other details – the location and ownership of servers is much less important than the architecture of the solutions.

So let’s get on with it. This is more than just philosophy. We are engineers – we thrive on specifics:

A colleague of mine likes to say, “ssh is cheating.” What he means is that it is unacceptable, in 2017, to leave your system so incomplete that it requires a human to log in and perform manual configuration in the event of an update or (heavens forfend) a reboot. Ssh as a protocol is fine (most of the tools below run over that protocol), it’s the manual intervention that constitutes cheating.

System configurations should be defined in software. It doesn’t really matter which domain specific language we pick, any of Puppet, Chef, or Ansible will work fine. It is important to pick one and get good at it, rather than trying to maintain a polyglot mashup.

Configuration scripts need to be under version control, just like any other software. Github is the default code repository these days, though there are reasons to go with something slightly more pre-integrated like Bitbucket.

Configuration changes should follow a structured process like Gitflow. It is worth noting that this tool is only helpful if coupled with human processes of communication and trust. Human beings need to check and validate each other’s work, to avoid overwriting or colliding with each other’s changes.

Once a change is checked in and reviewed, continuous integration, test, and deployment is the order of the day. Tools like Jenkins remove all of the manual interventions between approving a change and seeing the code built, tested, and pushed to production on whatever schedule the team has picked. Note that this is not an argument for the wild west. For most teams, most of the time, I’m an advocate of “read only Fridays,” since changes on a Friday frequently lead to long weekends at the keyboard.

All of this is just as true in the on-premises data center as it is in Amazon’s east-coast-2 availability zone. You don’t get to ignore modern systems engineering practices just because finance negotiated a really killer deal with Dell.

So when someone asks me, “on-prem vs cloud,” without further elaboration, I say “hybrid.” It’s an answer that allows me to get on with building robust, scalable, agile systems no matter who happens win out as the infrastructure provider.

The ever gathering storm

It’s summertime – season of thunderstorms. Most days are punctuated with ominous clouds and distant thunder. Actual rain, however, is rare. The forecast is consistent – temperatures may spike up to uncomfortably hot in the afternoon, and there are low odds of a thunderstorm. I carry an umbrella all day, and then water my garden by hand.

It reminds me of our industry-wide set-piece about the how genomic data is so terribly huge (and growing so incredibly fast!) that it’s going to overwhelm everything.

We’ve been living in the shadow of a tidal wave of data for more than 10 years. Honestly, it’s a little awkward that we’re still sounding the alarm.

The first time the phrase “data tsunami” appeared in my slides was in a presentation from 2007. That was when the first wave of so-called “next-gen” DNA sequencing instruments were really coming into their own. Those instruments increased the velocity of DNA sequencing by around three orders of magnitude. They also reduced the per-base costs of sequencing by an independent three orders of magnitude. Taken together, we experienced about a millionfold increase in the rate of data production.

We observed at the time that this rate increase was in excess of Moore’s Law. Now, as genomic diagnostics and precision / personalized medicines finally make their way into the clinic, we’re making the same observation today. While it’s flattering to hear brag words like “genomical,” it’s also a bit misleading.

Because you know what? We kept up before, and we’ll keep up now. I think that we’re actually better prepared for this decade’s data deluge than we were for the last one.

Sure, there was blood, sweat, and tears – that’s the job of engineering. We changed and adapted untenable practices – including choosing to discard the raw output images from the high resolution cameras on the new sequencers. Instead we stored only the information that was actually useful to the scientists – at the time it was base pairs and quality scores from all the reads. That idea was a fight at the beginning. I recall hours of conversation with scientists incredulous that I would suggest that any data could ever be deleted. Today, you can’t even get the raw images off of the sequencers.

We upgraded the infrastructure of biology facilities for the genomic age. We planned and built high performance network connections all the way out to laboratories. We consolidated data-producing instruments into “cores,” provisioned with infrastructure to handle the network and data storage load. We shifted servers and storage out of aging lab buildings and into co-located data centers. We combined independent compute farms into time-shares on integrated high performance computing environments. We worked out cost recovery schemes to make sure that it was sustainable. As public and private clouds have matured, we’ve continued to evolve, and I’m sure that we will continue to do so.

We also upgraded our human relationships. We forged partnerships with the technologists who build data storage, network, and computing systems. Together, we adapted the tools and techniques already in use in media and entertainment, finance, and other industries to be better fits for the challenges of science. We sent computer science students to biology journal clubs, and vice-versa, and eventually recognized “bioinformatics,” and “computational biology,” as important specializations in their own rights.

We have a decade of trust, education, and mutually beneficial work to build on.

So while it is certainly flattering to hear people proclaim that “genomical” is a better adjective than “astronomical” to describe rapid data growth, I’m not convinced that it’s cause for anything other than enthusiasm. A decade ago it was Terabytes of genomic sequence data for research. Now it’s Petabytes, or even Exabytes, of patient records for precision medicine and genomic diagnostics.

We’re gonna be fine, people. Sure, carry an umbrella, but think of it as “rainbow weather.”

A cautionary tale

Earlier this month, an information security firm found a multi-terabyte dataset of personal information on at least 198 million American voters unsecured, in a world readable S3 bucket. They did the responsible thing and notified the owners, and then wrote a very accessible description of the situation.

It serves as a decent cautionary tale and metaphor for some of the privacy concerns we face in health care, life sciences, and genomic medicine.

This post is about blame.

Could we blame the coder? The specific mistake that led to the data exposure was in their continuous integration and deployment workflow. A code change had the unintended effect of disabling access controls on the bucket. While the person who checked in that code change certainly made a mistake, it was far from the root cause of the failure. We would be remiss (but in good company) to blame the coder.

Could we blame the cloud provider? I say “absolutely not.” While this sort of exposure is more common with public clouds, it would be radically incorrect to put the blame with the hosting company. Amazon provides robust tools and policies to protect their customers from exactly this sort of mistake. In the health care / life sciences space, they offer a locked-down configuration of their services. They require customers to use this configuration for for applications involving HIPAA data. These controls can be imposed at a contract level, meaning that business owners – even those who are not cloud-savvy – have every opportunity to protect their data.

The owners of the bucket chose not to employ Amazon’s guard rails – despite knowing that they were amassing an incredible mass of sensitive and private data on nearly every American.

Could we blame the information security firm? While it is not uncommon to blame the person who finds the door unlocked, rather than to the one who failed to lock it, I say “no.”

Could we at least blame the whole firm who owned the bucket? The answer is certainly “yes,” as with the coder above – but it would be a mistake to stop there. This should be an extinction-level-event for the organization responsible, with good reason. I think it would be a shame to fail to go all the way to the root cause.

Responsibility rests with the people who created the dataset. This is true no matter whether we’re talking about genomes, medical records, consumer / social media trails, or whatever. Much of the data in that set was from public sources. Still, we all know that the power of data grows geometrically in combination with other data. When you do the work of aggregating, cleaning, and normalizing diverse datasets – it is your responsibility to be aware of the privacy and appropriate usage implications.

This imposes an ethical burden on data scientists. We cannot just blame the cloud provider, the coder, the business leaders, or whoever else. If you make a dataset that has the potential for this scale of privacy violation, you have a responsibility to make sure that it is appropriately handled. Beyond any technical controls, you have a responsibility to be sure that it is appropriately used. This responsibility transfers: If you hire a team to do things like this, you have a responsibility to be sure they do it in an ethical and effective way.

I’m far too jaded to believe that legal culpability will reach much beyond the coder – but it should.

The game of kings

A very smart and well informed colleague recently shared a thought that disturbed me. I’m writing it here mostly to get it out of my head, and also in the hopes that the eminently quotable Admiral Rickover will once again be proved right: “Weaknesses overlooked in oral discussion become painfully obvious on the written page.”

Here’s the observation: Machine learning and Artificial Intelligence are become a game of kings. The field is now the competitive arena for the likes of Microsoft, Google, Amazon, Facebook, and IBM. When companies of this scale compete, they do so with teams of thousands of people and spend (in aggregate) billions of dollars. The people on these teams are not a uniform sampling of their industry, they are the elite – high level professionals with the freedom to be choosy about their jobs.

The claim is that this presents an insurmountable barrier of entry to anyone who is not on one of those teams. Prosaically, when the King’s Hunt is afield, those of us without the resources of a king are well advised to stay out of the way.

In his words: “If you want to have an impact in AI or ML, the only real choice is which of the billionaires you want to work for.” Further, if you want to use these technologies, the only real choice is which billionaire to buy from.

I find this to be depressing, but not necessarily flawed. It would be easy (and potentially even more accurate) to make the same argument about computational infrastructure in the age of public exascale clouds.

There’s also an insulting subtext to the argument: If you are working with or on ML and AI and are not working for or with a billionaire, your work is de-facto pointless. Further, all the most talented people are flocking to join the King’s teams – maybe it’s just that you didn’t make the cut?

Did I mention that this particular colleague works part-time for Google? It reminds me of the joke about Crossfit: “How do you tell that somebody does crossfit? Oh don’t worry, they’ll tell you.”

With all that said, I don’t buy it. I fall back on Margaret Mead’s famous quote: “Never doubt that a small group of thoughtful, committed citizens can change the world; indeed, it’s the only thing that ever has.”

I harbor a deep-seated optimism about people. Everywhere I go, individuals and small teams absolutely sparkle with creativity and intelligence. These people are not the ‘B’ players, sad that they couldn’t make the cut to join the King’s hunting team. For my entire career, brilliant, hardworking innovators and entrepreneurs have been disrupting established power structures and upending entire markets. They don’t do this by fielding a second tier team in the old game – instead they invent a new game and change the world.

So while the point may be valid for established commodities, it is a bridge too far (and quite the leap of ego) to write off the combined innovative energy of the whole rest of the world.

I would welcome conversation on this. It feels important.

The blockchain part of Blockchain

The blockchain data structure (which is a part of, but distinct from the larger Blockchain ecosystem) consists, perhaps unsurprisingly, of an ordered series of “blocks.”

In addition to a payload of data and a few other housekeeping values, each block (except the first one, the “genesis” or “origination” block) contains the hash of the previous block. As described in a previous post, hash values are easy to verify and challenging to fake. A block is valid if it contains the hash of its predecessor. A valid blockchain contains only valid blocks.

A valid blockchain demonstrates an order of events. One cannot create a block without referring to the prior one. If the hashes are correct, we know that the blocks were created in sequential order, and therefore that the data stored on the chain was also written in that order. We know the relative order in which the data was written (we can’t generate a subsequent block without all the prior ones). We also know that the data has not changed since being written (changes to a block will change the hash, and require changes to all future ones).

Notably, we get no promises at all concerning validity or security. Merely storing information in a blockchain data structure does not make it correct, complete, or private. In fact, since most Blockchain systems are distributed ledgers (the topic of a future post), information on the chain is somewhat radically public. Every node in most Blockchain networks eventually see every piece of data on the chain.

Bitcoin and some (but not all) Blockchain systems up the ante on what constitutes a valid block by adding a nonce. The nonce is a value that, added to a block, yields a hash with specific and rare properties. This imposes a cost, called “proof-of-work,” on creating blocks. When creating a new block, authors must try (on average) a large number of nonces until they find one that yields a valid hash. The point of this is to make it computationally challenging merely to create a single new valid block at the end of the chain, and prohibitive to go back and corrupt earlier blocks.

The computational work of “mining” in the Bitcoin system is actually just searching for valid nonces. This is sufficiently different from conventional mining that it bears saying: In the usual use of the word “mining,” we are seeking out and refining a valuable resource. In Blockchain systems that use proof of work, the rare and precious resource at hand is the trustworthiness of the system itself. Value is not removed by the mining operation – it is actually being created.

Proof of work and the nonce

The blockchain technology ecosystem brings together a diverse set of codes and algorithms that have been developed over the past 50-ish years. It includes decades old cryptographic techniques like hashing and symmetric/asymmetric key encryption, and also includes relatively recent innovations related to distributed consensus.

The Blockchain ecosystem reminds me of the classic radio tag-line: It’s the best of the 80’s, 90’s, and today.

Proof of work is one component of that ecosystem. It is used to prevent denial of service attacks, in which large numbers of messages swamp and degrade a system. The system works by imposing a computational cost on the creation of valid messages. Receivers check whether messages are valid before they pay any attention to the contents.

The proof of work described in the original Blockchain paper is based on a system called Hashcash, that was developed in 1998 to combat spam email. The sender is required to find a value called a nonce that is specific to a particular message, and that demonstrates that they put effort into creating the message. A valid nonce is rare to find by chance, but easy to verify once found.

This property – numbers and relationships that are challenging to find, but trivial to verify – is the basis of most of modern cryptography. Hash functions are one example. A hash function takes arbitrary input and returns a value within a fixed range. In a good cryptographic hash, the result (sometimes simply called the “hash” of the input) is randomly distributed across that range. It is difficult to author an input to get any particular hash value.

The hashcash algorithm is simple: The nonce is combined with the message to be sent, and the combination is hashed. The hash result must be small relative to all possible hash results. Exactly how small is a parameter that can be used to tune the algorithm.

For example, if the hash function returns a 256 bit value, there are 2256 possible results. If we insist the nonce be a value that makes the first 16 of those bits ‘0’, we are insisting that senders find one of 2240 values from among 2256 possible hash results. The probability of this happening by random chance are one in 216, or something like 1 in 65,000.

On average (assuming that we have picked a good hash function) senders will have to try 216 nonces before finding a valid one. If we assume that each hash takes 1 second to calculate on a single CPU, the sender would invest (on average) slightly under a CPU day per message.

In the email system proposed in 1998 (I would love to use something like this, by the way) senders invest some amount of computation in creating a nonce for each message. Receivers sort or apply thresholds based on the value of the hash. Low numbered hashes represent an investment in the message. Human beings who type or dictate messages to small numbers of recipients won’t even notice the additional compute effort. Mass marketing campaigns will be expensive.

This exact computation is the work of “mining” in the Bitcoin network. The language of “mining” or “finding” bitcoins obscures the fact that we’re actually searching for nonces.

Of course, compute power keeps getting cheaper, so we need to have a flexible system. Fortunately, the tunable parameter of the nonce makes this simple. If compute performance on hash functions were governed by Moore’s law (it’s actually a bit more complex), then we would need to increase the strictness of our nonce by one bit every two years.

The Bitcoin network has been tuning its proof of work to produce valid blocks at a remarkably consistent rate of about one every ten minutes since 2010.

P.s: Thanks to Eleanor of Diamond Age Data Science for this post explaining the difference between probabilities and likelihoods. An earlier version of this post used the words incorrectly.

The unicorn rant

In biotech these days, I hear a lot of talk about “unicorns.” Sometimes they are rare fancy unicorns … purple, or glittery. At Bio IT World, I found myself moderating a conversation that involved herds and farms of these imaginary animals.

Of course, we were talking about finding and retaining top talent. In the staffing world, “unicorn” is the codeword for an impossibly ideal candidate with a rare mix of skills and experiences. My friends in the recruiting and staffing industries spend their days chasing unicorns. It seems really stressful for them.

Here’s the thing: Unicorns don’t exist.

I’m an engineer by training. I spend a lot of time designing and debugging complex systems. As a rule of thumb, if the plan relies on a continuous supply of something that is either vanishingly rare or (worse) nonexistent – it is a bad plan. When brainstorming, we might joke about knowing a reliable supplier of unobtanium. Sometimes we trot out the old cartoon with the guy saying “and then a miracle occurs.” Eventually. however, engineers sigh and set to work on a better plan.

Not so with many hiring managers, senior leaders, board members, and venture firms in biotech. From what I hear, the plan is to fight harder for the unobtanium, to hope for the miracle.

We need a better plan.

Before going further, I want to first reaffirm my commitment to finding and retaining the best people. Of course people make the difference. Of course we should be highly selective. And yes, of course there are massive, critical differences between candidates. It is a false comparison and a strawman argument to suggest that “making do with a third rate workforce, indiscriminately chosen,” is the only alternative to the unicorn quest.

There are three major pieces to building an organization that does not rely on unicorns:

  • Managers must assume the full time job of supporting and developing their teams.
  • Project plans, workflows, and team behaviors must err on the side of granular, achievable work – with mechanisms to self-correct when the plan is wrong.
  • Recruiting must focus on attitude and enthusiasm, not on finding the next hero.

The non-unicorn plan is straightforward to say, but requires diligent effort and consistency: Divide work into achievable pieces (planning, architecture, and project management are real jobs), hire enthusiastic and intelligent people (give recruiting and HR a fighting chance), and give those people the resources they need (management is a real job).

There’s plenty of literature on this, but you won’t find it in the sci-fi fantasy or the young adult section of the bookstore. Instead, do a quick google on “Hero culture.” You may find yourself reading about burnout, mythical man-months, success catastrophes, and flash-in-the-pan companies.

A more subtle pathology of the unicorn fetish is that it encourages the worst sort of bias and monoculture. When the written criteria are unachievable (unicorn!), then the hiring decision is actually subjective. Rejecting candidate after candidate based on “fit,” or poor interview performance is almost always a warning sign that we’re in bias and blind-spot territory.

As an aside, please recall that interviews are among the worst predictors of job performance.

From the candidate perspective, unicorn recruiting is simple: The best opportunities are only available to the people who have already had the best opportunities (the paper qualifications), and who give favorable first impressions to the hiring manager (bias and cronyism). From what I can see of the startup culture in both Boston and San Francisco, this is in fact the situation. In both cities, we have large populations of motivated people actively seeking work while recruiters work themselves to death. Meanwhile hiring managers make sci-fi/fantasy metaphors to support staffing plans that are based on miracles.

We can do better.

Finally, if none of that convinces you, then perhaps consider the traditional mythology about who, exactly, should be sent to capture a unicorn.

Either way, we’re doing it wrong.