Identity

Another day, another data breach.

The Swedish government has apparently exposed personal identifying data on nearly all of their citizens. The dataset came from the ministry of transportation. It included names, photographs, home addresses, birthdates, and other details about citizens – as well as maintenance data on both roads and military and government vehicles. Perhaps most squirm-inducing, the dataset included active duty members of the special forces, fighter pilots, and people living under aliases as part of a witness protection program.

The data has been exposed since at least 2015. We’re just finding out about it now.

I have written in the past about the perils of compiling this sort of dataset. This particular ministry has a good excuse: They print identification cards. The fact that they emailed the information around in clear-text and handed management and storage off to third party processors with little or no diligence? That’s another story.

It provides a decent opportunity to talk about identity and zero knowledge proofs.

Identity is one of those concepts that appears simple from a distance, but that aways seems to wriggle out of any rigorous definition.

For today, let’s say that identity is a set of properties associated with a person. We use these properties (or knowledge of them) to verify that someone is who they say they are. We can deal with group identities and pseudonyms in another post. Let’s also agree to defer metaphysics and philosophy around any deeper meaning of the word “identity,” at least for the moment.

My name, birthdate, address, social security number, fingerprints, bank account numbers, current and past addresses, first pet, high school, mother’s maiden name, and so on are all properties attached to and supporting “my” identity. This list includes examples commonly used by banks and websites. When someone calls my bank on the phone and claims to be me, the bank might ask for any or all of the above. As the answers provided by the caller match the ones in the bank’s database, the bank gains confidence that the caller is actually me.

Once a birthday, address, or other similar fact is widely known, it becomes substantially less useful in demonstrating identity. It also becomes substantially easier for people to fake an identity.

This data breach brings a particular problem into stark relief: Our identity cards have all sorts of identifying information printed on them, and that information is available to anybody holding the card (or the database from which it came).

The bartender doesn’t need to know my birthday – they need to know that I am of legal age to buy alcohol. They certainly don’t need to know my address or organ donor status.

This is where zero knowledge proofs come in. A zero knowledge proof is an answer to a question (“is this person of legal drinking age?”) that does not expose any unnecessary information (like date of birth or address) beyond that answer.

In order to implement zero knowledge proofs we usually need a trusted third party who holds the private data and provides the answers. Instead of printing dates of birth on ID cards, we might print a simple barcode. The bartender would scan the barcode with a phone or other mobile app, and receive a “yes” or a “no” answer immediately from the appropriate agency. In some cases, the third party might send me a message letting me know that somebody scanned my ID card. In some cases (like financial transactions), they might even wait for me to validate the request before sending the approval.

If the third party is trustworthy, having them in the loop can radically increase our information security – both by reducing information leakage and by providing a trail of requests for information. Imagine a drivers license that did not contain your private information, and could be invalidated as soon as you reported it lost.

Blockchain technologies seem likely to provide a robust solution to the question of a trusted third party in a trust-free environment. More on that in a later post.

The oldest part of Blockchain

Public key encryption, or PKE, is one of the oldest techniques in the blockchain toolbox. PKE dates from the 1970s and has a lineage of being “discovered” by both military and civilian researchers. It’s powerful stuff: One of the early implementations of a PKE system, called “RSA,” was famously classified as a munition and subject to export control by the United States government.

While PKE (also called “asymmetric key”) is a critical technology in Blockchain systems, I care about it mostly because I get a lot of email. With PKE it is conceptually straightforward to encrypt and “sign” a message in such a way that the identity of the sender is publicly verifiable and that the intended receiver is the only one who can open it. I’ll explain why that matters for my INBOX further on in this post.

Most of the algorithms that underpin PKE make use of pairs of numbers – called “keys” – that are related in a particular way. These “key pairs” are used as input to algorithms to encrypt and decrypt messages. A message that has been encrypted with one of the keys in a pair can only be decrypted using the matching key. As with crytographic hashes, these systems rely on the fact that while it is straightforward to create a pair of keys, it is computationally impractical to guess the second key in a pair given only the first.

This is conceptually distinct from “symmetric” key algorithms, which use the same key for both encryption and decryption.

In one common use of PKE, one half of a key pair is designated as “public,” while the other is “private.” We share the public key widely, posting it on websites and key registries. The private key is closely held. If someone wants to send me a message, they encrypt it using my public key. Since I’m the only one with the private partner to that public key, I’m the only one who can decrypt the message.

Similarly, if the sender wants to “sign” their message, they can encrypt a message using their private key. In this case, only people with access to the public key will be able to decrypt it. This is, of course, not very limiting. Anybody in the world has access to the public key. However, it is still useful, because we know that this particular message was encrypted using the private partner to that public key.

What is particularly cool is that we can “stack” these operations, building them one on top of the other. A very common approach is to encrypt a message twice, first using the sender’s private key to provide verification of their identity, and then a second time using the recipient’s public key, to ensure that only the recipient can open the message.

Many Blockchain systems use this system to verify that the person (or people, or computer program) authorizing a transaction is in fact allowed to do so. In fact, because key pairs are cheap and plentiful, every single Bitcoin transaction has used a unique pair of keys, created just for that one event.

Back to my surplus of email: None of my banks or healthcare providers have deployed this nearly 40 year old capability for communicating with me. Instead, a growing fraction of my inbound email consists of notifications that I have a message waiting on some “secure message center.” I am exhorted to click a link and sometimes required to enter my password in order to see the message.

This practice is actively harmful. Fraudulent links in emails are among the primary vectors by which computers are infected with malware. When we teach the absolute basics of information security, “don’t click the link,” comes right after “don’t share your password,” but before “we will never ask for your password.”

Email systems that use PKE have been around since I’ve been using technology, and somehow my bank and my hospital haven’t caught on. The HIPAA requirement to use “secure messaging,” has driven them backwards, not forwards.

Perhaps if we call it “Blockchain messaging,” it’ll finally catch on.

The answer is “Hybrid”

“Hybrid,” is the answer.

I’m talking about “on-prem vs cloud,” the bugaboo trick question that has plagued us for nearly a decade.

What I mean by reframing the question like this is that – absent other details – the location and ownership of servers is much less important than the architecture of the solutions.

So let’s get on with it. This is more than just philosophy. We are engineers – we thrive on specifics:

A colleague of mine likes to say, “ssh is cheating.” What he means is that it is unacceptable, in 2017, to leave your system so incomplete that it requires a human to log in and perform manual configuration in the event of an update or (heavens forfend) a reboot. Ssh as a protocol is fine (most of the tools below run over that protocol), it’s the manual intervention that constitutes cheating.

System configurations should be defined in software. It doesn’t really matter which domain specific language we pick, any of Puppet, Chef, or Ansible will work fine. It is important to pick one and get good at it, rather than trying to maintain a polyglot mashup.

Configuration scripts need to be under version control, just like any other software. Github is the default code repository these days, though there are reasons to go with something slightly more pre-integrated like Bitbucket.

Configuration changes should follow a structured process like Gitflow. It is worth noting that this tool is only helpful if coupled with human processes of communication and trust. Human beings need to check and validate each other’s work, to avoid overwriting or colliding with each other’s changes.

Once a change is checked in and reviewed, continuous integration, test, and deployment is the order of the day. Tools like Jenkins remove all of the manual interventions between approving a change and seeing the code built, tested, and pushed to production on whatever schedule the team has picked. Note that this is not an argument for the wild west. For most teams, most of the time, I’m an advocate of “read only Fridays,” since changes on a Friday frequently lead to long weekends at the keyboard.

All of this is just as true in the on-premises data center as it is in Amazon’s east-coast-2 availability zone. You don’t get to ignore modern systems engineering practices just because finance negotiated a really killer deal with Dell.

So when someone asks me, “on-prem vs cloud,” without further elaboration, I say “hybrid.” It’s an answer that allows me to get on with building robust, scalable, agile systems no matter who happens win out as the infrastructure provider.