Question

Looking for DeepVariant training data on yeast strains

0

Entering edit mode

23 months ago

Michael 55k

I would like to try DeepVariant to predict variants in Illumina WGS data from some yeast strains using the S. cerevisiae R64 reference genome. For that, I need training data to use with dv_make_examples.py. If I understand correctly, these should best be from the same species (Saccharomyces cerevisiae). If I am further not mistaken we need VCF files with "true" or validated variants and corresponding sequencing data (which I can get from SRA). I was unable to find such variant data in SGD for download. Or does the species not matter so that I could simply use human data to train the model?

data training DeepVariant • 1.1k views

ADD COMMENT • link 23 months ago by Michael 55k

0

Entering edit mode

You could download the VCF files from https://www.nature.com/articles/s41586-018-0030-5 which are available here: http://1002genomes.u-strasbg.fr/files/

ADD REPLY • link 23 months ago by GenoMax 147k

0

Entering edit mode

Yes, I have seen this paper. I was just wondering if I wouldn't simply replicate the GATK pipeline the authors have used. The point is, there was some filtering involved but no other manual curation or validation in my understanding.

ADD REPLY • link 23 months ago by Michael 55k

score 1 · Answer 1 · 2022-12-29

I have come to the conclusion that using DeepVariant does not make sense in my case, therefore I will not bother with it instead we will use GATK4 in a similar fashion as in Peter et al. 2018 who used an older version. Using deep learning methods makes a lot of sense in the presence of curated and manually validated training data which doesn't seem to be the case here. The Gold Truth matters and was obtained for human data by generations of scientists using more traditional methods of de-novo variant calling and labor-intensive validation. The presence of the sheer amount of training data has allowed Google to outperform sequence-based variant callers in competitions but one should not forget that this success would not have been possible without de-novo variant callers. So, for other organisms where training data are sparse, using DeepVariant makes no sense. (Alphabet folks: feel free to provide evidence to the contrary)