Hi all,
I hope you're well. I have one question regarding variant calling in NGS data.
We have sequenced a number of human germline exomes, captured with Agilent SureSelect 50mb All Exon kit, in the Illumina GAII platform, paired-end, 75bp reads. We have aligned with BWA and called variants with Samtools mpileup.
I was looking at one particular variant, located at GRCh37 7:124499165. This variant is called heterozygously with OK SNP quality (QUAL 20-41) in 6/41 exomes, and with low qualities in a further 16 exomes. Sequencing depths for the exomes where variant has been called with OK quality ranges from 61 to 139 HQ reads, however, although the reference allele (A) is supported by reads in both directions (in the majority of cases 2-4 fold more reads in the forward direction though), the alternative allele is supported, in all cases, by only reads in the reverse direction (range 17 to 45 per exome). Of course, these variants will not pass the strand-bias test, so will be filtered out during analysis.
However, this variant is present in the exac dataset with a high AF: http://exac.broadinstitute.org/variant/7-124499165-A-C, which makes me think that maybe it's not being dropped by exome post-filtering. I believe this variant might be a systematic mapping/calling error because 1. the reads supporting the variant are in only one direction in all cases (at least in one batch of data), suggesting a sequence-specific error, and 2. because other people seem to have observed this variant, which apparently passed all their filters, and attempted Sequenom validation but failed (Bass AJ et al. Genomic sequencing of colorectal adenocarcinomas identifies a recurrent VTI1A-TCF7L2 fusion. Nat Genet 43, 964–968 (2011)).
However, the question becomes, is there any way to tell a (possibly) systematic error of this kind from a real variant? The reason I'm interested in this question is because we are doing a new experiment in which we are sequencing over this region with Fluidigm technology, in which we get reads in only one direction so we lose the ability to do strand-bias testing.
Thank you very much,
Daniela
If you suspect something is a mapping/calling artifact, I suggest you try a different aligner (such as BBMap) and possibly a different variant-caller (such as GATK) to see if the variation remains. You could even try a different reference (say, HG20).
Thanks very much for your suggestion Brian. I used a different caller (GATK HaplotypeCaller) and indeed it seems the variant is not being called. However I'm a bit puzzled as to why it is so common in the exac data if I suspect loads of those exomes used GATK. Maybe they used GATK UnifiedGenotyper, which does call it. I will ask over there. Thanks again! :)