Entering edit mode
8.2 years ago
bitpir
▴
250
Hello, I am trying to calculate the total number of SNPs in my VCF file. When I tried using Picardtool's CollectVariantCallingMetrics, it calculates 174387 SNPs. However, when I tried to use the command line to calculate my SNPs, I get 199320. The command line I used is as follow:
grep -v "#" VARIANT_VCF | wc -l
Does anyone know why there is a difference in 24933 SNPs called in the 2 calculation methods?
Thank you very much for your help!
Yes, my VCF contains both SNPs and indels. Under the CollectVariantCallingMetrics-generated file, theTOTAL SNP = 174387, TOTAL INDEL = 24542,TOTAL_MULTIALLELIC_SNPS = 6659, TOTAL_COMPLEX_INDELS=278. Adding those up, I get 205886 variants. I am using picardtools 2.0.1. I was reading GATK's website and it says that "Many factors affect these counts, including cohort size, relatedness between samples, strictness of filtering, ethnicity of samples, and even algorithm improvement due to updated software." I've also tried using picardtools 2.4.1 but did not get the same number as 2.0.1: TOTAL_SNP=174387, TOTAL_INDEL=24542, TOTAL_MULTIALLELIC_SNPS=646,TOTAL_COMPLEX_INDELS=292. Total variant = 199287.
Looks like TOTAL_SNP and TOTAL_INDEL stayed the same, which makes sense because those are simple events. The difference is in TOTAL_MULTIALLELIC_SNPS and TOTAL_COMPLEX_INDELS. They might be doing some overlapping. For example, a single indel may sometimes be split into multiple events by the variant caller and they might be trying to correct for that.
Since GATK discusses this tool on their website, it's probably a good idea to ask about it on the GATK forum.