Entering edit mode
12.7 years ago
Leandro Batista
▴
100
I am calling SNPs from mouse whole genome sequences by using GATK.
Right now I'm stuck on the Variant quality score Recalibration because I don't know what to use as a training set for mouse SNPs. Every example that I see concerns human genome analyses and people use both Hapmap and Omni data, in general.
Is there someone doing the same thing in mouse who might know a good training set for this species?
Thanks
To overcome that, I was thinking to repeat the SNP calling on all the strains I'm using to compare. That way I would follow the exact same steps and parameters for each data.
I've already tried those VCF but it seems that they are an older version of VCF, VCF3, that is no longer supported by GATK. At least that is the error message. I also tried to convert them using vcftools but they are too big and it takes too long.
Welcome to the world of big data. Can I ask you what your end goal is? Perhaps there is another way. I have been working on calling variants in mouse tumors. There are a fair amount of mouse sequences in the short read archive, but if you don't want to convert the vcf you most certainly won't want to deal with raw reads.
There's no problem in converting this file. I just wanted to know if there was another way or another files. I just received the whole-exome sequence for 1 strain and our goal is to call variants, specially SNPs and compare it to some other strains completely sequenced in the Sanger's mouse project, as you mentioned. Actually the file is being converted right now.
Good deal. One thing I would watch out for is pipeline discrepancies. I found that you can get alot of false positives if you don't have control of the backgrounds (what your comparing your exome to)
Still considering this question of Training set, do you use the Sanger VCFs as truth sites or just training?