I am sequencing human exome data and looking for clinically relevant SNPs. I am using the standard GATK workflow, applying a hard filter, and then evaluating with snpeff and looking for ClinVar SNPs.
Overall, I'm getting about 1 in 25,000 exome bases being reported as a SNP at the end of GATK. Additionally, a single human exome results in about 450 ClinVar SNPs that are annotated with known disease states.
This seems quite high for me. Does anyone have a good idea about what frequency of SNPs I should be finding for a normal, healthy human exome? I assume I have lots of false positives due to my crude hard filtering method, but these are SNPs that survived the entire GATK workflow, including recalibration, etc., so I thought they would be higher quality.
Thanks for any perspective.
That depends on multiple factors such as sample origin, exome kit, parameters used for alignment and variant calling. Usually after filtering and annotating the final VCF with Snpeff or VEP and restricting your variant list only to the exon regions you should get between 20-40k variants.