Hi there
I'm calling SNPS on human genome using GATK.
For each sample I have both low coverage WGS data and also exome data.
I'm calling SNPs per sample using HaplotypeCaller and producing gVCFs to later combine across samples. Everything works just great and as expected.
But I'm wondering should I combine the Exome + WGS data for each sample or call on them separately?
So should I for example.. run CombineGVCFs once on all the WGS gVCFs and once on all the Exome gVCFS to produce one merged file for exome and one for WGS?
Or should I throw everything together and produce one final merged VCF? And if I do this how should I tell GATK which Exome reads are from the same sample as which WGS reads? Read groups?
Thanks..
That is a really great answer thank you!!
"because WGS and exome data have very different error modes." - I had a feeling this would be the problem.
"apply more stringent filters to anything that is not concordant" - That makes a lot of sense. Good idea.
Ok thanks. I am merging the exome data and the WGS data across samples separately.
By the way do you have any feel for number of samples needed to apply Variant Quality Score Recalibration to the WGS? I found a couple of sources in the GATK documentation saying at least 30 samples for the exome (and to pad out with 1000 Genomes data if you have less). What I've read seems to suggest that on WGS even a single sample is ok.. but is there a practical minimum?
You're welcome!
For filtering, those are indeed the standard recommendations. VQSR can usually run successfully even on just one WGS sample, though you may get better results with the newer method, GATK CNN, which uses deep learning and does a better job on indels especially. For anything above one WGS sample you should be fine with VQSR. The reason exomes have those 30+ recommendations is because there are so much fewer variants in a sample's exome, you need a lot of them to accumulate enough variants to build a stable model using VQSR's internal algorithm.
Good luck!
Don't add answers when you mean to reply to someone's post. Only add answers if you're answering the top level post.