Why a true variant is not getting called by Haplotypecaller.
1
1
Entering edit mode
8.7 years ago
AshishKS ▴ 10

I am using HaplotypeCaller for calling the variants for 120 gene based target sequence. For gene PMS2, there is a variant with coverage 21 (in that position) Allele fraction of the alternate allele is 5 reads ( 24%). The mapping quality of the reads are mostly 0 , I geuss because of the very similar pseudogene PMS2CL (mapping done with hg19 using BWA-MEM). This variant is not getting called by haplotypecaller, but is actually a true variant (found with sanger sequencing). I compared the BAM file with bamout file, both are similar. I also tried mapping the sequence with custom reference sequence (target region based). when I used that bam file for variant calling, It called that specific variant (though it also increased the coverage depth and increased the number of variants many fold, which are false positives).

I wonder what can be the possible explanation to this. what is the cutoff criteria, which haplotypecaller is using in this case? why the variant is not getting called at first place?

Here is the link to screenshot of a PMS2 variant with coverage 21 (also atached as file) https://drive.google.com/file/d/0Bwibh75M75p_bGJrNlpyRTVSNHVZRDMzUFB0UDFOV2gyM2Rj/view?usp=sharing Variant at PMS2 gene

next-gen variant-calling HaplotypeCaller • 3.7k views
ADD COMMENT
1
Entering edit mode

I'm not seeing 21 reads in that screen shot - however, the reason is most likely due to "The mapping quality of the reads are mostly 0". These reads are probably filtered out before they are used for SNP calling because we cannot be confident about what they are telling us.

I don't know what else to say. If you want to call SNPs, you're going to need more high-quality data - or even better, more individuals known to have the same interesting genotype. Good luck! :)

ADD REPLY
0
Entering edit mode

Im sorry, I don't understand - what did you want me to read here?

ADD REPLY
4
Entering edit mode
8.7 years ago
lh3 33k

Because "the mapping quality of the reads are mostly 0". An allele balance of 24% is also very bad. You can tune parameters to call the variant anyway, but you are likely to end up with lots of false positives elsewhere. You are hitting the limits of data. You have to choose between low FN and low FP. You can hardly have both.

ADD COMMENT
1
Entering edit mode

And it also explains why everything improves when a custom reference is used. Remove the pseudogene from consideration and there is no competition for mapping. So your depth of coverage goes up, you have better Mapping Quals and perhaps better allele balance. Where it is in the context of targeted sequencing anyway this approach may be valid, however, depending on enrichment strategy Ashish you may want to confirm that off-target enrichment from the pseudogene isn't expected. If it is amplicon based what is the probability that the region from the pseudogene might also be amplified?

ADD REPLY
0
Entering edit mode

Yes it is amplicon based, and there is aprox. 100% probability that pseudogene can also be amplified, Because, both PMS2 and PMS2CL almost similar. May be I should use a custom track, which ONLY exclude pseudogene PMS2CL from hg19 reference, As I am having trouble only with this gene.

ADD REPLY
2
Entering edit mode

If your primer pairs would definitely amplify both sequences then this isn't a good idea. You'll artificially be placing all reads amplified from the pseudogene on PMS2. Any conclusions you make about variants, genotypes, and frequencies at that point will be wrong.

ADD REPLY
0
Entering edit mode

yes, I totally agree with you, custom reference (I used previously) is also doing the same, Placing the pseudogene on the PMS2.

ADD REPLY

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6