I am seeing strange behavior when comparing alignments generated by bwa mem 0.7.12 to two different reference genomes: hg19 and hs37d5 (basically hg19 plus additional decoy sequences). We have DNA-seq data. I noticed that when using hg19, there are very few alignments to the MHC region on chromosome 6, and even fewer of these have nonzero mapping quality. When using hs37d5, there are dramatically more alignments to the region and these have mostly high mapping quality scores. I have not observed this phenomenon anywhere else I've looked in the genome. The behavior is robust to multiple different choices of BWA parameters. Can anyone explain why the inclusion of the additional 35Mb of decoy sequences in hs37d5 would drastically improve the number and quality of alignments to this region of chr6?
I don;t know why the decoy is doing this but I just wonder is there something about chromosome 6?
I'd like to ask have you loaded the region in IGV? Are they all mapping to a very small region? Recently, I also found a very strange behaviour of reads from both ATAC-seq and ChiP-seq data mapping to a small area of chr6. The reads were also highly enriched for a very long motifs (20+ bases). I suspect these regions were missed by repeat masking because they only occured within short regions that bridged between very large repeat masked regions.
Thanks for this idea. I've looked at the alignments in IGV. They do map to several punctate peaks, leaving most of the region uncovered. We would expect coverage of the entire region to be fairly even. I'm guessing we are seeing alignments to regions that are easier to sequence, but still don't know why these alignments disappear when using hg19.