Question

Mapping Quality 0 Reads And Variant Calling

6

Entering edit mode

11.1 years ago

huskerjeff492 ▴ 160

This is a continuation to another post (bwa mapping qualities of 0) (I need to thank Istvan Albert for his generous help, Thank you)-- but I think different enough to warrant a new question.

I would like to know how others have handled this situation.

I have captured some genes and one of them happens to be COL11A2. After looking at the sam records (Thanks Istvan) I found these reads map to chr6 but also

(chr6_dbb_hap3,+4426630,151M,0;chr6_cox_hap2,+4589159,151M,0;chr6_ssto_hap7,+4625489,151M,1;chr6_qbl_hap6,+4377546,151M,1;chr6_mann_hap4,+4602639,151M,1;chr6_mcf_hap5,+4619088,151M,1;)

The multiple mappings mean those reads have a mapping score of 0. I'm very sure GATK ignores reads with a mapping score of 0 -- mpileup does not exclude mapping 0 reads (well - guess it depends on your -q parameter). The mapping quality scores also go into the variant call score but I am not sure of the details on these calculations.

BTW I also checked some exomes for this region, I would imagine almost no one that uses hg19 has ever called a variant for this gene.

I want to call variants for this gene but the mapping quality score of 0 will affect my outcome.

I could remove the chr6_?? hapmap, etc fragment from hg19 but that means I am not using a std reference. I could use the ReassignMappingQualityFilter in GATK but that would reassign all 0 to whatever value.

Neither solution is perfect does someone have a better idea?

Thanks

bwa gatk mpileup • 8.0k views

ADD COMMENT • link updated 11.1 years ago by Jorge Amigo 14k • written 11.1 years ago by huskerjeff492 ▴ 160

score 6 · Answer 1 · 2013-10-17

for mapping purposes you may include all reference genome contigs, but haplotype alternatives should be avoided. the best reference I've found so far to support this is Heng Li's explanation
reads with mq=0 should be removed before any variant calling. maybe you don't want to go as high as GATK's 20, but a mapping quality threshold should be established previously if you would like to improve your variant calling results. mq=0 indicates that those reads may not be even there, so if you detect any variants on them it would be the same as finding variants which you wouldn't be able to position through the genome. therefore, useless. this has been taken into account since the very NGS beginning (in fact a useful reference to support this would be the almost classic Heng Li's MAQ paper).

score 0 · Answer 2 · 2013-10-16

This is an interesting case, with this new information about the specific coordinates. The reads in that region apparently lie in the complicated MHC region of the human genome. hg19 has several haplotypes of the MHC available, and your reads are mapping to many of them due to this redundancy.

In this particular case, particularly if you are focusing on a single gene in the MHC, I think it's fine to exclude the secondary haplotypes and remap. A better option, if you can manage it, is to determine the correct haplotype for your particular sample and use only that one in the mapping. It may not matter, though, for this particular gene.

You can see more discussion here:

Additional Data In Human Genome (Hg18 / Hg19) Assembly ?

What Is The Difference Between Hg18 And Hg19?

The second link has a particularly relevant snippet from Obi Griffith:

"This is important because if you are doing an alignment against hg19 and your sequences come from one of these regions with alternate haplotype assemblies you can get a new kind of (apparently) ambiguous alignment where your sequence aligns equally well to chr6 but also chr6_apd_hap1, chr6_cox_hap2, etc. This may cause problems in existing scripts that are not aware of this issue."