It’s summertime – season of thunderstorms. Most days are punctuated with ominous clouds and distant thunder. Actual rain, however, is rare. The forecast is consistent – temperatures may spike up to uncomfortably hot in the afternoon, and there are low odds of a thunderstorm. I carry an umbrella all day, and then water my garden by hand.
It reminds me of our industry-wide set-piece about the how genomic data is so terribly huge (and growing so incredibly fast!) that it’s going to overwhelm everything.
We’ve been living in the shadow of a tidal wave of data for more than 10 years. Honestly, it’s a little awkward that we’re still sounding the alarm.
The first time the phrase “data tsunami” appeared in my slides was in a presentation from 2007. That was when the first wave of so-called “next-gen” DNA sequencing instruments were really coming into their own. Those instruments increased the velocity of DNA sequencing by around three orders of magnitude. They also reduced the per-base costs of sequencing by an independent three orders of magnitude. Taken together, we experienced about a millionfold increase in the rate of data production.
We observed at the time that this rate increase was in excess of Moore’s Law. Now, as genomic diagnostics and precision / personalized medicines finally make their way into the clinic, we’re making the same observation today. While it’s flattering to hear brag words like “genomical,” it’s also a bit misleading.
Because you know what? We kept up before, and we’ll keep up now. I think that we’re actually better prepared for this decade’s data deluge than we were for the last one.
Sure, there was blood, sweat, and tears – that’s the job of engineering. We changed and adapted untenable practices – including choosing to discard the raw output images from the high resolution cameras on the new sequencers. Instead we stored only the information that was actually useful to the scientists – at the time it was base pairs and quality scores from all the reads. That idea was a fight at the beginning. I recall hours of conversation with scientists incredulous that I would suggest that any data could ever be deleted. Today, you can’t even get the raw images off of the sequencers.
We upgraded the infrastructure of biology facilities for the genomic age. We planned and built high performance network connections all the way out to laboratories. We consolidated data-producing instruments into “cores,” provisioned with infrastructure to handle the network and data storage load. We shifted servers and storage out of aging lab buildings and into co-located data centers. We combined independent compute farms into time-shares on integrated high performance computing environments. We worked out cost recovery schemes to make sure that it was sustainable. As public and private clouds have matured, we’ve continued to evolve, and I’m sure that we will continue to do so.
We also upgraded our human relationships. We forged partnerships with the technologists who build data storage, network, and computing systems. Together, we adapted the tools and techniques already in use in media and entertainment, finance, and other industries to be better fits for the challenges of science. We sent computer science students to biology journal clubs, and vice-versa, and eventually recognized “bioinformatics,” and “computational biology,” as important specializations in their own rights.
We have a decade of trust, education, and mutually beneficial work to build on.
So while it is certainly flattering to hear people proclaim that “genomical” is a better adjective than “astronomical” to describe rapid data growth, I’m not convinced that it’s cause for anything other than enthusiasm. A decade ago it was Terabytes of genomic sequence data for research. Now it’s Petabytes, or even Exabytes, of patient records for precision medicine and genomic diagnostics.
We’re gonna be fine, people. Sure, carry an umbrella, but think of it as “rainbow weather.”