DeepVariant

Earlier this week, Google published DeepVariant, a machine learning (ML) based tool for genomics. The software is now available on the DNANexus platform.

This is kind of a big deal, and also kind of not a big deal.

Does it matter?

It’s a big deal in the same way that ML systems exceeding the performance of radiologists on diagnostic classification of images is a big deal. Sure, it’s a little creepy and intimidating when a computer program exceeds a respected and trained professional at one of their tasks. On the other hand, it would take a spectacularly naive and arrogant person to claim that a radiologist’s only job is image classification.

It’s not a big deal because there is still so much domain expertise required to derive scientifically meaningful results from genomic data, and because these methods are still changing all the time.

The DeepVariant team took one of the points in the genomic analysis workflow where scientists have historically used eyeballs and intuition to identify subtle patterns in the data. Prior variant callers were built atop that intuition, coding it into complex algorithms. That’s why there was a well characterized image format (Pileup) already available as a starting point for the project – scientists still want to look at the results of their callers to see if the results align with intuition.

That’s why there was a contest for the team to win. Because we’re still figuring this stuff out.

It was a good place to start, and the system performed much as we might expect.

Much to Learn

I saw a preview of this technology at the Broad Institute, sometime in mid to late 2016. We were all really impressed. I remember that someone asked exactly the right question: “Can it -discover- a new sort of biological artifact or feature? One that we haven’t seen before?”

The team was unambiguous: Of course it can’t. Until the patterns are present in the training data, there’s nothing there to learn. Further, this particular approach will never suggest that, maybe, we’re looking at the problem sideways.

Put another way: There is a lot of genomic biology still to be learned.

Every year that I’ve been in and around this field, there has been at least one discovery that has up-ended major pieces of institutional knowledge and dogma. Formerly safe assumptions get melted down, combined with new insights, and formed into superior alloys all the time.

The more subtle challenge

There is a more subtle challenge in this particular case: We’re dealing with measurements rather than facts here. The process of DNA sequencing is complex and subtle, with biases and omissions and room for improvement throughout. The way that this particular test was framed up assumes that there is one unambiguous correct answer to the question of variation, and that we already know that answer.

A genomic biologist – or scientist of any stripe – has to hold two truths in their head at the same time: They must gather data to answer questions, and they must also accept that the data may suggest refinements to the question itself. Those refinements to the question, the ones that call existing knowledge in question – that’s where the real innovation happens.

Given enough data, machine learning now excels at answering well formed questions. The task of questioning our assumptions and changing the question itself remains much more subtle.

The take home

The short version is that computers are here, right now, to take away any part of any job that involves memorizing a large corpus of data and then identifying new examples of old categories based on patterns in that data. This is just as true for eyeballing pileup images as it is for reading ZIP codes or license plates.

Machine learning is also here for any part of your job in which you merely turn the crank on a bunch of rules and formulas. This has already impacted a bunch of different jobs: Tax preparation, law, real estate, and travel planning have all undergone radical changes in the last decade.

One final thought: This is also a big deal because while it takes massive computation to create a recognizer like DeepVariant, it is trivial to use that recognizer on any particular input. Variant calling in the old model takes up a lot of CPU power – which can now be turned (hopefully) to more subtle questions.



2 thoughts on “DeepVariant”

  • It is interesting… still looks quite computationally intensive but suspect this will improve with newer chips. I assume they are working on calling somatic variants for tumor samples?

  • The interesting thing to me is that the use of images. That’s a hack so that they can recycle known architectures. It suggests that changing the font, or color scheme would alter (and probably destroy) the ability to call correctly. Surely there is some more direct encoding that only captures the truly relevant features, and allows better efficiency. It’s just waiting to be found, and with no human input you could begin to explore this space. The only quesion is whether the improvements to runtime and accuracy would be closer to 0.1% or 10% 10x or E10.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.