BWA-mem read mapping depends on chromosome order
1
2
Entering edit mode
9.5 years ago

I'm working with PDX samples, so I expect some mouse contamination in my primarily human DNA sequencing data. To separate reads of mouse and human origin, I generated a new reference genome by concatenating the human and mouse reference genomes. A co-worker's experience lead me to ask whether changing the order would change the number of reads that mapped to the human or mouse chromosomes, and it did. The first time I mapped the reads, I put the human chromosomes first in the concatenated combined genome. When I put the mouse chromosomes first, the number of reads mapped to human chromosomes increased by 120. I mapped about 150 million reads in total using bwa-mem 0.7.6a. I only counted reads with a mapping quality score above 20. When I map the same data twice to the same genome, nothing changes (so this isn't the a general stochastic effect).

Unfortunately I don't have time at the moment to investigate the characteristics of the inconsistently mapped reads. Has anyone else observed changes in mapping when chromosome order changes? Has anyone else looked into the inconsistently mapped reads? Although I'm working with concatenated genomes from different species, I would expect this effect to apply to the many different orders human chromosomes have in various flavors of the hg19 reference genome.

The command line was

bwa mem -t 4 -R [read group stuff] -M ${genome_index} f1.fastq f2.fastq

I de-duped, compressed and sorted after mapping.

Thanks in advance for any input.

[edited to address comments]

software-error sequencing bwa • 4.2k views
ADD COMMENT
0
Entering edit mode

This may depend on the way you handle ambiguously-mapped reads. What was the exact command line?

Also, it might help to put bwa-mem in the title and tag the post with bwa.

ADD REPLY
0
Entering edit mode

Thanks for the suggestions.

ADD REPLY
0
Entering edit mode

I might not understand the exact technicalities of PDX sample, but I don't understand chromosome concatenation..? I worked with bacterial DNA-seq, where bacterial was inside mouse host and there was some mouse DNA contamination. We just mapped against our bacterial genome first. Anything that didn't map, we mapped against the host.. That way you have two bam files rather than parsing out particular chromosome of interest from bam file...(I'm guessing that is the way you are doing this).

I don't think there should be any differences in chromosome order. To my (limited) knowledge bwa mem works on assigning scores to best matches, therefore regardless of the chromosome positions you should get the best match i.e the best reads alignment. And also I believe all chromosomes are fragmented anyway at the indexing step..

ADD REPLY
0
Entering edit mode

We get better results by mapping the PDX data to a concatenated mouse-human genome than mapping mouse-depleted reads to the human genome. With this approach, we get ~4% of PDX data mapping to the mouse chromosomes, and basically 0% of normal human data mapping to mouse chromosomes. When mapping directly to the mouse genome in the subtractive approach, 40% of PDX reads and 40% of normal human reads map to mouse. I assume the optimal strategy for reducing mouse contamination in human samples differs from reducing mouse contamination in bacterial samples because of the greater sequence similarity between human and mouse.

Like you, I would expect bwa-mem to report the best match regardless of the chromosome positions. That's not what I'm observing, even only considering reads with mapping quality scores > 20. I hadn't thought about your point about chromosomes being fragmented during the indexing step. To me that suggests that the differences I'm seeing have to do with the order of the content of the chromosomes, not the chromosomes per se.

To be explicit about the concatenated genome, the steps are just

cat mm10.fa | sed 's/^>/>mm10./' >mm10.hg19.fa # prefix mouse chromosome names with mm10 to distinguish from human
cat hg19.fa >> mm10.hg19.fa
ADD REPLY
0
Entering edit mode

I suggest you give BBSplit a try; it's specifically designed for this problem, and has various settings for how to handle reads that map ambiguously between the two genomes (but not within a genome). So, it should give better results for duplicated regions within a genome.

ADD REPLY
2
Entering edit mode
9.5 years ago
matted 7.8k

Just a guess, but these minor differences could be caused by how bwa reference indexing handles ambiguous bases (N). At least in bwa bwasw, N's in the reference are replaced by random bases in the indexing step. Altering the chromosome order will change the random seed, which will give you a different set of random bases in those stretches (which may map slightly differently, in certain edge cases). You can read more here. I haven't specifically checked if bwa mem does that as well, but it might be worth checking.

ADD COMMENT
0
Entering edit mode

This sounds quite likely. I think sw and mem use identical indexes so the substitution would be the same.

ADD REPLY
0
Entering edit mode

That sounded likely to me too, but I didn't find any instances of different mappings that could be explained by it. I checked out some reads that were mapped to a mouse chromosome only when the mouse chromosomes preceded the human ones in the concatenated genome, and I didn't find any N's within 2kb in either direction of the human or mouse chromosome mapping positions. I assume that for the problem to be dependent on which bases replace Ns, the alignments would be to regions with N in the reference genome. To the point about the random seed being dependent on the chromosome order, could that alone be a sufficient explanation for the alternate mapping positions? The read below has secondary alignments in both cases.

Example alignments of the same read (from a germline human non-xenograft source)

aligned to a mouse chromosome in the mouse-first concatenated genome

D0004:230:H08B1ADXX:1:1111:3605:15641   387     mm10.chr9       9071909 60      88H32M  chr15   63578694        0       AAAAAAAAAAAAAAAAAAAAAACCAATATATC        B<BBBBB<<<<<<077''0'''''07'0''0'        NM:i:1  MD:Z:31A0       AS:i:31 XS:i:0  RG:Z:FC1-NA12878-01_S1_L001_R   SA:Z:chr15,63578710,-,17S103M,60,0;

aligned to a human chromosome in human-first concatenated genome

D0004:230:H08B1ADXX:1:1111:3605:15641   147     chr15   63578710        60      17S103M =       63578694        -119    GATATATTGGTTTTTTTTTTTTTTTTTTTTTTTCTTCTTAATTACATGTGTGGTTTTATTTTAGAAAAGCCAGTGAAGGTTTGAAGAGTATAAACCCAGGTGAGACAGCACCCTCTATGC        '0''0'70'''''0''770<<<<<<BBBBB<B<B<7<<<BFBBBFFFBBBBFFBFIIFBBFBFIFBBFFFBFIFFF<FFBB7BBBF<B0FFBBFFBFB7FFFFBBBB<00BBBFFF<BBB        NM:i:0  MD:Z:103        AS:i:103        XS:i:0  RG:Z:FC1-NA12878-01_S1_L001_R   SA:Z:chr16,59306930,+,87S33M,3,1;
ADD REPLY

Login before adding your answer.

Traffic: 2771 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6