Plenty To Do

One of the things that we’ve got going for us as technologists is that the underlying reality of biology changes pretty slowly. The human genome has been the same size for at least the last 10,000 years, and maybe as long as 300,000 years depending on who you ask. This means that 30 measurements of each of our 3.2 billion base pairs at (give or take) 10 bits per base yields the same ~100 gigabytes (GB) of raw files as it did back at the beginning of high throughput sequencing.

Back in the early 2000s, 100GB was a lot of data – more than the storage capacity of most workstations. Size limits on 32-bit filesystems could quietly truncate files, and a full gigabit connection to the internet was an expensive proposition. Merely storing and accessing the files correctly, much less getting any science done, was a solid day’s work for the journeyman technologist. These days cell phones and USB sticks hold terabytes and nobody bats an eye at using the cloud – at least not for technical reasons.

All of which gives little comfort because scientific ambition expands to fill the space created by technology.

[ME, CHEERFUL] Hey, yo! I’ve built that system you said you needed. The one to host and analyze a hundred thousand human genomes? I’m gonna take the win and cut out for a long weekend.
[BIOLOGIST, ENRAPTURED] What if we sequenced every single cell in a sample? There are thousands of cells per sample! We could use your whole system on just a couple of samples!
[ME, PALM ON FACE] That sounds great. That’s perfect.

Back in the day we measured gene expression using the ratio of intensities from competing varieties of fluorescent molecules crammed into literal divots on a plastic “microarray.” At the end of the day we just needed to store a couple of floating point numbers for each of, perhaps, a million locations of genomic interest. Even complex time-series experiments would generate a few megabytes of raw data. The metadata describing how the experiment had been conducted would often occupy more space on the disk (and more time for the analyst) than the data itself.

Then some (literal) genius was like “what if we used the DNA sequencer for that? We could -count- transcripts directly instead of relying on ratios” and the person with the single cell system chimed in with “oh have I got an idea for you.”

I was chatting with a group recently whose present-generation instrument generates 5 Terabytes (TB) over the course of a week or so. It’s cool science, capable of imaging both transcripts and proteins with sub-cellular resolution. It’s quite likely to reveal some fundamental stuff that will require us to once again update -all- the textbooks and retrain -all- the large language models.

They’re keeping up pretty well for the time being, but the next revision of their platform will increase data volumes fivefold while also reducing runtimes to a little over a day. It’s those geometric accelerations that we technologists need to watch out for. The thing where genomics accelerated faster than Moore’s law for several years was the result of innovations that made it simultaneously 1,000 times faster and also 1,000 times cheaper to sequence DNA.

[BIOLOGIST] What if there was stuff outside the germ line? Some sort of META-genome? Could we sequence that too?
[LAB] 100%
[CLINICAL] I bet we could measure circulating tumor and/or fetal DNA if we just kept on sequencing deeper and deeper.
[LAB] On it.
[GENE EDITING PEOPLE] Everybody, everybody, check this out!
[ME] I had a data model. It was a nice data model.

Not that I’m complaining. This is good, interesting, meaningful work. I’m glad to have lucked into an industry with enough technology challenges to keep me and everybody I know busy for our entire careers – provided that the money people can figure out how to get the industry working again.

I’m confident that future generations of technologists will have ample opportunity to place palm to forehead, breathe deeply, and say “oh wow that’s cool. Yes, of course I can support that, I just don’t know how yet.”

Leave a Reply Cancel reply

Recent Posts

Categories