Hello everyone, I downloaded one of the NA12878 WES data, SRR098401, from NCBI FTP server with sratoolkit. I used BWA-MEM to align it to hg19 and GATK best practices to obtain VCF that only contain SNVs. I did not use any filtering.
After I obtain VCF, I compared it with NISTv3.3.2 (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/NA12878_HG001/NISTv3.3.2/GRCh37/) data.
I am not sure why but numbers are looking really strange to me. Total number of SNVs in SRR098401 is 1,637,352 and number of True Positives (matched with NISTv3.3.2) is 1,323,681.
Isn't this too high? I think the number of total variants (1,637,352) is really high for a WES experiment.
Also, I did the same steps for 20 WES data of NA12878 and total number of unique SNVs is 3,707,040. First, I thought these are WGS experiments but metadata says it is WES experiment. Should I use any other file like "manifest" file?
Thank you in advance
What seems to be the problem? True positives was 1.3 million according to NIST.
Hello, NIST gold standard VCF is coming from WES and WGS experiments as I know. Actually, our advisor said 1.3 million (1.6 with False Positives) SNVs are too many for a single WES experiment. For all 20 WGS data for the same individual, NA12878, total SNV (with False Positives) are 3 million.
Thank you
You havent said what the problem is. An advisor said that's too many? For some purposes you may need a top ten list; if you were diagnosing a rare disease using WES then you want to find the SNV that are not common in the general population, and maybe filter to keep only the most deleterious protein modifications. For identifying an individual maybe you want a few hundred SNV of the highest quality, or well characterized SNV from training datasets. As you've stated the situation here, running just one sample and finding 1.6 million SNV is a good thing. You could quality filter it by 50%.