Earlier this month, an information security firm found a multi-terabyte dataset of personal information on at least 198 million American voters unsecured, in a world readable S3 bucket. They did the responsible thing and notified the owners, and then wrote a very accessible description of the situation.
It serves as a decent cautionary tale and metaphor for some of the privacy concerns we face in health care, life sciences, and genomic medicine.
This post is about blame.
Could we blame the coder? The specific mistake that led to the data exposure was in their continuous integration and deployment workflow. A code change had the unintended effect of disabling access controls on the bucket. While the person who checked in that code change certainly made a mistake, it was far from the root cause of the failure. We would be remiss (but in good company) to blame the coder.
Could we blame the cloud provider? I say “absolutely not.” While this sort of exposure is more common with public clouds, it would be radically incorrect to put the blame with the hosting company. Amazon provides robust tools and policies to protect their customers from exactly this sort of mistake. In the health care / life sciences space, they offer a locked-down configuration of their services. They require customers to use this configuration for for applications involving HIPAA data. These controls can be imposed at a contract level, meaning that business owners – even those who are not cloud-savvy – have every opportunity to protect their data.
The owners of the bucket chose not to employ Amazon’s guard rails – despite knowing that they were amassing an incredible mass of sensitive and private data on nearly every American.
Could we blame the information security firm? While it is not uncommon to blame the person who finds the door unlocked, rather than to the one who failed to lock it, I say “no.”
Could we at least blame the whole firm who owned the bucket? The answer is certainly “yes,” as with the coder above – but it would be a mistake to stop there. This should be an extinction-level-event for the organization responsible, with good reason. I think it would be a shame to fail to go all the way to the root cause.
Responsibility rests with the people who created the dataset. This is true no matter whether we’re talking about genomes, medical records, consumer / social media trails, or whatever. Much of the data in that set was from public sources. Still, we all know that the power of data grows geometrically in combination with other data. When you do the work of aggregating, cleaning, and normalizing diverse datasets – it is your responsibility to be aware of the privacy and appropriate usage implications.
This imposes an ethical burden on data scientists. We cannot just blame the cloud provider, the coder, the business leaders, or whoever else. If you make a dataset that has the potential for this scale of privacy violation, you have a responsibility to make sure that it is appropriately handled. Beyond any technical controls, you have a responsibility to be sure that it is appropriately used. This responsibility transfers: If you hire a team to do things like this, you have a responsibility to be sure they do it in an ethical and effective way.
I’m far too jaded to believe that legal culpability will reach much beyond the coder – but it should.