News:Google announces DeepVariant
3
8
Entering edit mode
6.9 years ago
Hussain Ather ▴ 990

Google announced the release of DeepVariant, a deep learning tool for constructing true genome sequences with greater accuracy than classical methods. It only works on somatic calls, but very interesting to see the uses of image recognition in genome reconstruction.

DeepVariant is the first of what we hope will be many contributions that leverage Google's computing infrastructure and ML expertise to both better understand the genome and to provide deep learning-based genomics tools to the community.

genome deep-learning google • 6.2k views
ADD COMMENT
5
Entering edit mode

One (more) step towards "Ok Google .. analyze this dataset, predict the downstream consequences".

ADD REPLY
1
Entering edit mode

haha, sounds familiar but I was expecting this from Google, finally its out and to be honest seems pretty impressive with Open Source availability as well.

ADD REPLY
1
Entering edit mode

We implemented the DeepVariant pipeline with Docker and Nextflow here

Lifebit integrate it the pipeline with example parameters

Would love your feedback on this.

Thanks!

ADD REPLY
13
Entering edit mode
6.9 years ago

Some thoughts:

1) DeepVariant does not work on somatic calls - only germline.

2) Yes, it beat GATK, but only barely (don't quote me on the numbers, but it was something like 98% vs 98.5%)

3) The method is insane, in that they actually create millions of images, encoding read information as colors and alpha, and then use their image-processing neural network to do pattern recognition for calling.

4) it is quite computationally expensive for running, not even to mention training the NN.

5) It absolutely requires new training data for each platform that you're going to run it on. Chemistry changed slightly? Got a new type of instrument? Doing targeted regions instead of WGS? You'll need a new gold standard run and you'll need to retrain the algorithm from scratch. They used the Genome in a Bottle dataset. That's limited to ~80% of the genome, and their TPs are only calls validated on at least two sequencing technologies.

Don't get me wrong - it's cool to see someone enter the space with a really crazy orthogonal method, but it's not a panacea, and the hype about AI solving all of our variant calling problems is pretty clearly overblown. That doesn't mean that this won't be useful in the future, just that it's not there yet.

ADD COMMENT
2
Entering edit mode

Just to clarify, DeepVariant does not use images. Their first implementation was based on inception and it used images.

But now deepvariant doesn't use images, but rather tensor representations of genome data.

ADD REPLY
1
Entering edit mode

Thanks Chris for such concise & informative review. I have a question/comment about your point 5 about the need to retrain the model every time something changes (bear with me: I haven't read the DeepVariant method in any detail).

First, I wonder to what extent it is that necessary to retrain the parameters even for small changes in the library preparations. Presumably (big if), small changes in, say, chemistry should still give good results.

But most importantly, I don't think DeepVariant is conceptually different from other methods when it comes to using training and test data. I mean, DeepVariant makes the need of training data explicit. But implicitly other methods also need training data that in theory should be re-analysed every time something changes. For example, when we ("we" meaning us or the program we use) decide to filter out variants supported by less than 3 reads, effectively we are saying "given the training data I've seen until now, 3 is a good threshold".

ADD REPLY
2
Entering edit mode

My assumption is that whether running GATK with data produced on a HiSeq 2500, a NovaSeq patterned flow-cell, or an amplicon-based technology, you'll get reasonable results. This is thanks to lots of effort that went into making their model (and it's heuristics) general. (In essence, yes, using all the training data we've seen up until now).

The NN picks out artifact patterns automatically, which is impressive, but that makes it very susceptible to changes. Given a large and diverse training corpus, there's no reason why it can't learn general patterns too! My point is that these large, highly validated training sets don't exist, so if you hop to a new (or older) technology, you can't expect DeepVariant to just work. (again, for now)

It's also a contrast to current callers, where you can often look at your new type of data, see "oh, it looks like I'm overcalling at homopolymer runs", and then tweak some parameters to fix the problem. NN is a total black box and has to be retrained from scratch.

So yeah, I absolutely think that NN-based variant calling will be useful (and probably better!) in the future. I'm just trying to inject some reality into the proceedings here. :)

ADD REPLY
0
Entering edit mode

Chris, from my experience, SAMtools / BCFtools mpileup achieves higher sensitivity / specificity than GATK when compared to the gold standard in clinical genetics, Sanger. I imagine that it also beats Deep Variant, in this regard. Variant calling need not be so complex / convoluted.

ADD REPLY
1
Entering edit mode

While essentially saying that germline calling is a solved problem is probably a little bit of a stretch, it's absolutely not true for somatic calling. Tumor ploidy and purity come into play, FFPE may be involved, or you might be looking for very low-frequency events, etc. There's a lot of complexity there, and a lot of places where a NN might offer substantial improvements if designed correctly.

ADD REPLY
0
Entering edit mode

Yes, I should have stated that mpileup beats everything else (from my experience) where germline variants are concerned. Never benchmarked it for somatic. You are correct: a lot of extra factors go into somatic variant calling.

ADD REPLY
0
Entering edit mode

Just want to be clear - the DV team deserves kudos. Variant calling is a hard problem, their method is interesting, and their performance is admirable. If I'm negative about anything, it's the breathless "Google AI has solved genomics!" press coverage, which you can't blame the authors for!

ADD REPLY
1
Entering edit mode

I totally agree how the hoopla over "Google AI solved genomics!" is on. At the end of the day it is a product they are bringing and pretty sure the buzz will be more than what it actually preaches. Having said that, I will feel it is worth taking a look at it as to how germline calls are made and improved but to what extent it can be useful will be a matter of time. For somatic calls am sure they will bring up something soon. I still need to get an understanding of the algorithm though as to how they implemented. But am happy that this kind of work also pushes one step ahead of making genomics as a research service product, and I support that.

ADD REPLY
6
Entering edit mode
6.9 years ago
mdepristo ▴ 70

Hi all,

Glad to see a post here on Biostars about DeepVariant's open source release. If you'd like more information on accuracy and runtime of DeepVariant across a variety of datasets, have a look at the blog post from DNANexus about DeepVariant on their internal benchmark datasets.

ADD COMMENT
0
Entering edit mode

Did you not develop it? Should probably state a disclaimer.

ADD REPLY
1
Entering edit mode

Mark isn't trying to hide that fact - consider your post the disclaimer!

ADD REPLY
4
Entering edit mode
6.9 years ago

Steven Salzberg has a nice response to the hype:

No, Google's AI Program Can't Build Your Genome Sequence

https://www.forbes.com/sites/stevensalzberg/2017/12/11/no-googles-new-ai-cant-build-your-genome-sequence/#5e35eefb5774

ADD COMMENT

Login before adding your answer.

Traffic: 2066 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6