I would like to try DeepVariant to predict variants in Illumina WGS data from some yeast strains using the S. cerevisiae R64 reference genome. For that, I need training data to use with dv_make_examples.py
. If I understand correctly, these should best be from the same species (Saccharomyces cerevisiae). If I am further not mistaken we need VCF files with "true" or validated variants and corresponding sequencing data (which I can get from SRA). I was unable to find such variant data in SGD for download. Or does the species not matter so that I could simply use human data to train the model?
You could download the VCF files from https://www.nature.com/articles/s41586-018-0030-5 which are available here: http://1002genomes.u-strasbg.fr/files/
Yes, I have seen this paper. I was just wondering if I wouldn't simply replicate the GATK pipeline the authors have used. The point is, there was some filtering involved but no other manual curation or validation in my understanding.