Hi everyone
I would like to take a second opinion regarding a variant with DP =17. The variant is in a very high GC content area so the low coverage is reasonable.
I do hard filtering because I have few samples for VQSR, I thinks it is okay according to GATK hard filtering, the values are as follows:
AC=1;AF=0.500;AN=2;BaseQRankSum=0.892;ClippingRankSum=0.000;DP=17;ExcessHet=3.0103;FS=10.843;
MLEAC=1;MLEAF=0.500;MQ=60.00;MQRankSum=0.000;QD=2.81;ReadPosRankSum=-1.534;
SOR=2.795 GT:AD:DP:GQ:PL 0/1:14,3:17:76:76,0,506
I am worried because AD is 14,3 which means only 3 reads only support the ALT? is it safe to accept it as a het variant. The variant is very interesting based on our data, so I don't wanna discard without a strong reason.
Any comments, how do you think? with many many thanks
Nice answer! I wonder if the low number of ALT reads could be because ALT reads have a mismatch with the reference. This mismatch could make some reads to map equally well to multiple locations hence their MAPQ is 0 hence they are filtered out somewhere in the analysis.
Maybe useful to inspect the unfiltered bam files to look for the reads mapped at the incriminated locus.
Thanks Gabriel for the nice answer. I did not use a nor b, I will give them a try. Just used GATK pipeline. Thanks dariober, that's another good point I did not think of, I will check other reads . Thanks
good comment. my personal experience with this is if I have to choose between filtering by mapping quality or mappability, pick mappability, it gives greater noise reduction: see figure on page 46 https://media.nature.com/original/nature-assets/nature/journal/v505/n7481/extref/nature12886-s1.pdf
Interesting figure. Never thought about this. So I check if the reads are mappable, if yes, that's a sign of less likely for errors, which means trust the call and trust the variant? Do I understand right?
The reads are not mappable, the genomic region is (or isn't). Mapping quality is defined on a per read basis but mappability is defined per genomic locus. Both aim at reducing the probability that there is a CNV or repeat or a read so badly damaged due to sequencing errors that it could end up in multiple places or a read so short that could have come from multiple genomic loci.
If the genomic region is highly mappable and you have good mapping quality, it likely means that you can discard the "read is mismapped" possibility. now, it is still possible to have sequencing errors or bizarre strand effects due to Illumina even if you are correctly mapped.
Great, got it now. Thanks so much for the clear explanation. Much appreciated Gabriel :)