Another day, another data breach.

The Swedish government has apparently exposed personal identifying data on nearly all of their citizens. The dataset came from the ministry of transportation. It included names, photographs, home addresses, birthdates, and other details about citizens – as well as maintenance data on both roads and military and government vehicles. Perhaps most squirm-inducing, the dataset included active duty members of the special forces, fighter pilots, and people living under aliases as part of a witness protection program.

The data has been exposed since at least 2015. We’re just finding out about it now.

I have written in the past about the perils of compiling this sort of dataset. This particular ministry has a good excuse: They print identification cards. The fact that they emailed the information around in clear-text and handed management and storage off to third party processors with little or no diligence? That’s another story.

It provides a decent opportunity to talk about identity and zero knowledge proofs.

Identity is one of those concepts that appears simple from a distance, but that aways seems to wriggle out of any rigorous definition.

For today, let’s say that identity is a set of properties associated with a person. We use these properties (or knowledge of them) to verify that someone is who they say they are. We can deal with group identities and pseudonyms in another post. Let’s also agree to defer metaphysics and philosophy around any deeper meaning of the word “identity,” at least for the moment.

My name, birthdate, address, social security number, fingerprints, bank account numbers, current and past addresses, first pet, high school, mother’s maiden name, and so on are all properties attached to and supporting “my” identity. This list includes examples commonly used by banks and websites. When someone calls my bank on the phone and claims to be me, the bank might ask for any or all of the above. As the answers provided by the caller match the ones in the bank’s database, the bank gains confidence that the caller is actually me.

Once a birthday, address, or other similar fact is widely known, it becomes substantially less useful in demonstrating identity. It also becomes substantially easier for people to fake an identity.

This data breach brings a particular problem into stark relief: Our identity cards have all sorts of identifying information printed on them, and that information is available to anybody holding the card (or the database from which it came).

The bartender doesn’t need to know my birthday – they need to know that I am of legal age to buy alcohol. They certainly don’t need to know my address or organ donor status.

This is where zero knowledge proofs come in. A zero knowledge proof is an answer to a question (“is this person of legal drinking age?”) that does not expose any unnecessary information (like date of birth or address) beyond that answer.

In order to implement zero knowledge proofs we usually need a trusted third party who holds the private data and provides the answers. Instead of printing dates of birth on ID cards, we might print a simple barcode. The bartender would scan the barcode with a phone or other mobile app, and receive a “yes” or a “no” answer immediately from the appropriate agency. In some cases, the third party might send me a message letting me know that somebody scanned my ID card. In some cases (like financial transactions), they might even wait for me to validate the request before sending the approval.

If the third party is trustworthy, having them in the loop can radically increase our information security – both by reducing information leakage and by providing a trail of requests for information. Imagine a drivers license that did not contain your private information, and could be invalidated as soon as you reported it lost.

Blockchain technologies seem likely to provide a robust solution to the question of a trusted third party in a trust-free environment. More on that in a later post.

The oldest part of Blockchain

Public key encryption, or PKE, is one of the oldest techniques in the blockchain toolbox. PKE dates from the 1970s and has a lineage of being “discovered” by both military and civilian researchers. It’s powerful stuff: One of the early implementations of a PKE system, called “RSA,” was famously classified as a munition and subject to export control by the United States government.

While PKE (also called “asymmetric key”) is a critical technology in Blockchain systems, I care about it mostly because I get a lot of email. With PKE it is conceptually straightforward to encrypt and “sign” a message in such a way that the identity of the sender is publicly verifiable and that the intended receiver is the only one who can open it. I’ll explain why that matters for my INBOX further on in this post.

Most of the algorithms that underpin PKE make use of pairs of numbers – called “keys” – that are related in a particular way. These “key pairs” are used as input to algorithms to encrypt and decrypt messages. A message that has been encrypted with one of the keys in a pair can only be decrypted using the matching key. As with crytographic hashes, these systems rely on the fact that while it is straightforward to create a pair of keys, it is computationally impractical to guess the second key in a pair given only the first.

This is conceptually distinct from “symmetric” key algorithms, which use the same key for both encryption and decryption.

In one common use of PKE, one half of a key pair is designated as “public,” while the other is “private.” We share the public key widely, posting it on websites and key registries. The private key is closely held. If someone wants to send me a message, they encrypt it using my public key. Since I’m the only one with the private partner to that public key, I’m the only one who can decrypt the message.

Similarly, if the sender wants to “sign” their message, they can encrypt a message using their private key. In this case, only people with access to the public key will be able to decrypt it. This is, of course, not very limiting. Anybody in the world has access to the public key. However, it is still useful, because we know that this particular message was encrypted using the private partner to that public key.

What is particularly cool is that we can “stack” these operations, building them one on top of the other. A very common approach is to encrypt a message twice, first using the sender’s private key to provide verification of their identity, and then a second time using the recipient’s public key, to ensure that only the recipient can open the message.

Many Blockchain systems use this system to verify that the person (or people, or computer program) authorizing a transaction is in fact allowed to do so. In fact, because key pairs are cheap and plentiful, every single Bitcoin transaction has used a unique pair of keys, created just for that one event.

Back to my surplus of email: None of my banks or healthcare providers have deployed this nearly 40 year old capability for communicating with me. Instead, a growing fraction of my inbound email consists of notifications that I have a message waiting on some “secure message center.” I am exhorted to click a link and sometimes required to enter my password in order to see the message.

This practice is actively harmful. Fraudulent links in emails are among the primary vectors by which computers are infected with malware. When we teach the absolute basics of information security, “don’t click the link,” comes right after “don’t share your password,” but before “we will never ask for your password.”

Email systems that use PKE have been around since I’ve been using technology, and somehow my bank and my hospital haven’t caught on. The HIPAA requirement to use “secure messaging,” has driven them backwards, not forwards.

Perhaps if we call it “Blockchain messaging,” it’ll finally catch on.

The blockchain part of Blockchain

The blockchain data structure (which is a part of, but distinct from the larger Blockchain ecosystem) consists, perhaps unsurprisingly, of an ordered series of “blocks.”

In addition to a payload of data and a few other housekeeping values, each block (except the first one, the “genesis” or “origination” block) contains the hash of the previous block. As described in a previous post, hash values are easy to verify and challenging to fake. A block is valid if it contains the hash of its predecessor. A valid blockchain contains only valid blocks.

A valid blockchain demonstrates an order of events. One cannot create a block without referring to the prior one. If the hashes are correct, we know that the blocks were created in sequential order, and therefore that the data stored on the chain was also written in that order. We know the relative order in which the data was written (we can’t generate a subsequent block without all the prior ones). We also know that the data has not changed since being written (changes to a block will change the hash, and require changes to all future ones).

Notably, we get no promises at all concerning validity or security. Merely storing information in a blockchain data structure does not make it correct, complete, or private. In fact, since most Blockchain systems are distributed ledgers (the topic of a future post), information on the chain is somewhat radically public. Every node in most Blockchain networks eventually see every piece of data on the chain.

Bitcoin and some (but not all) Blockchain systems up the ante on what constitutes a valid block by adding a nonce. The nonce is a value that, added to a block, yields a hash with specific and rare properties. This imposes a cost, called “proof-of-work,” on creating blocks. When creating a new block, authors must try (on average) a large number of nonces until they find one that yields a valid hash. The point of this is to make it computationally challenging merely to create a single new valid block at the end of the chain, and prohibitive to go back and corrupt earlier blocks.

The computational work of “mining” in the Bitcoin system is actually just searching for valid nonces. This is sufficiently different from conventional mining that it bears saying: In the usual use of the word “mining,” we are seeking out and refining a valuable resource. In Blockchain systems that use proof of work, the rare and precious resource at hand is the trustworthiness of the system itself. Value is not removed by the mining operation – it is actually being created.

Proof of work and the nonce

The blockchain technology ecosystem brings together a diverse set of codes and algorithms that have been developed over the past 50-ish years. It includes decades old cryptographic techniques like hashing and symmetric/asymmetric key encryption, and also includes relatively recent innovations related to distributed consensus.

The Blockchain ecosystem reminds me of the classic radio tag-line: It’s the best of the 80’s, 90’s, and today.

Proof of work is one component of that ecosystem. It is used to prevent denial of service attacks, in which large numbers of messages swamp and degrade a system. The system works by imposing a computational cost on the creation of valid messages. Receivers check whether messages are valid before they pay any attention to the contents.

The proof of work described in the original Blockchain paper is based on a system called Hashcash, that was developed in 1998 to combat spam email. The sender is required to find a value called a nonce that is specific to a particular message, and that demonstrates that they put effort into creating the message. A valid nonce is rare to find by chance, but easy to verify once found.

This property – numbers and relationships that are challenging to find, but trivial to verify – is the basis of most of modern cryptography. Hash functions are one example. A hash function takes arbitrary input and returns a value within a fixed range. In a good cryptographic hash, the result (sometimes simply called the “hash” of the input) is randomly distributed across that range. It is difficult to author an input to get any particular hash value.

The hashcash algorithm is simple: The nonce is combined with the message to be sent, and the combination is hashed. The hash result must be small relative to all possible hash results. Exactly how small is a parameter that can be used to tune the algorithm.

For example, if the hash function returns a 256 bit value, there are 2256 possible results. If we insist the nonce be a value that makes the first 16 of those bits ‘0’, we are insisting that senders find one of 2240 values from among 2256 possible hash results. The probability of this happening by random chance are one in 216, or something like 1 in 65,000.

On average (assuming that we have picked a good hash function) senders will have to try 216 nonces before finding a valid one. If we assume that each hash takes 1 second to calculate on a single CPU, the sender would invest (on average) slightly under a CPU day per message.

In the email system proposed in 1998 (I would love to use something like this, by the way) senders invest some amount of computation in creating a nonce for each message. Receivers sort or apply thresholds based on the value of the hash. Low numbered hashes represent an investment in the message. Human beings who type or dictate messages to small numbers of recipients won’t even notice the additional compute effort. Mass marketing campaigns will be expensive.

This exact computation is the work of “mining” in the Bitcoin network. The language of “mining” or “finding” bitcoins obscures the fact that we’re actually searching for nonces.

Of course, compute power keeps getting cheaper, so we need to have a flexible system. Fortunately, the tunable parameter of the nonce makes this simple. If compute performance on hash functions were governed by Moore’s law (it’s actually a bit more complex), then we would need to increase the strictness of our nonce by one bit every two years.

The Bitcoin network has been tuning its proof of work to produce valid blocks at a remarkably consistent rate of about one every ten minutes since 2010.

P.s: Thanks to Eleanor of Diamond Age Data Science for this post explaining the difference between probabilities and likelihoods. An earlier version of this post used the words incorrectly.


Over the summer, I have the opportunity to think deeply about the ecosystem of technologies that go by the name “Blockchain.” I’m focusing particularly on how these might apply in a couple of different scientific and healthcare contexts. I plan to post snippets here from time to time, as much to force me to clarify my thinking as anything else. As Hyman Rickover said, "Nothing so sharpens the thought process as writing down one's arguments. Weaknesses overlooked in oral discussion become painfully obvious on the written page."

One challenge when trying to talk about blockchain is that it is massively hyped – sitting right at the peak of Gartner’s hype cycle. Most of the meetings I go to these days include at least one person who asks, regardless of the topic at hand, “what about blockchain?”

Another challenge is that Blockchain is strongly associated with Bitcoin and other cryptocurrencies. The hype brings a certain breathiness to the conversation, while the finance connection brings associations with fraud and nefarious dealings. Neither of these is entirely merited – but I’m finding it important to keep in mind as I explore.

For all that, the foundational documents are remarkably crisp, lucid, and readable. The original bitcoin paper, written under the pseudonym “Satoshi Nakamoto,” is only eight pages long – plus a half page of references.

It’s clear to me that there’s important work to be done in this space – and I’m thrilled to have the time to take part.