How does BWA handle duplications?
2
0
Entering edit mode
2.8 years ago
Nikolas • 0

Hello,

I am having difficulties understanding the behaviour of BWA when mapping reads to a contig that contains a duplication.

In a viral metagenomics context, I am using BWA to map reads back to my assembly. Some of my contigs contain duplications, either biological or artificial. Ideally, I would like BWA to pick one of the possible positions for each read at random and designate this as the only (primary) alignment.

I couldn't find clear information about how BWA handles the presence of duplications, so I made a test: I took a 855 bp contig and 50 read pairs that I know map to it. Then I copied the sequence of the contig and pasted it directly after the original to get a 1710 bp sequence with two identical halves. When I then map my 50 read pairs against this construct, I see that

a) the distribution of the alignments is not random (the outcome is reproducible and the fist half gets much more alignments than the second half),

b) each read is mapped exactly once and therefore none of them gets a 256 ("not primary alignment") or 2048 ("supplementary alignment") flag,

c) BWA's -M option doesn't change the output at all.

I am a bit puzzled by this behaviour and would appreciate any help to explain it and to achieve the goal of picking an alignment at random if multiple ones are feasable.

Thanks in advance,

Nikolas

duplication bwa • 1.7k views
ADD COMMENT
0
Entering edit mode
2.8 years ago

I think something like this has come before on biostars...

a) the distribution of the alignments is not random (the outcome is reproducible and the fist half gets much more alignments than the second half),

Can you share on e.g. dropbox some data to reproduce this? I guess you are using bwa mem (better to say which version though).

b) each read is mapped exactly once and therefore none of them gets a 256 ("not primary alignment") or 2048 ("supplementary alignment") flag,

Yes, by default bwa returns one alignment per read or read segment. Also, 2048 should refer to the case where a read is split in two or more chunks and each chunk is aligned at different positions. The smaller (I think) chunks get flagged as 2048 with -M option. Maybe what you want is the option -a (output all alignments for SE or unpaired PE) in which case additional alignments are flagged with 256?

ADD COMMENT
0
Entering edit mode

enter image description here Thanks for your answer! Indeed I forgot to mention that I am using bwa mem version 0.7.17. Thank you for pointing out the -a option. This indeed leads to all reads mapping twice (once in each part), in the manner I would expect. One of the alignments gets a 256 flag. But when I filter the reads with the 256 flag out, I get the same result as before. The distribution might be uneven for stochastic reasons but I cant test that because the outcome is absolutely reproducible, not randomly generated in reach run.

I cannot share the data, because it's not mine but I can share a screenshot of the alignments from IGV. On the top you see the outcome with the -a option. At the bottom without it.

ADD REPLY
0
Entering edit mode

So I repeated the test with a different contig (similar length as the first one), but with 900 read pairs mapping as opposed to 50. The distribution of alignments between the two identical halves of the sequence looks much more similar now (see screenshot, bottom). I believe now that the uneven distribution I saw in the first test was indeed a stochastic effect and BWA is chosing the primary alignment semi-randomly if two positions are equally good. "Semi" because the outcome is reproducible given identical inputs (which is a good thing), but random enough for downstream analyses. enter image description here

ADD REPLY
0
Entering edit mode
2.8 years ago

b) is expected behavior for bwa, I believe. I'm not sure about a), but I think a fairer test might be to make two separate sequences, instead of one doubled sequence, and see how many align to each.

ADD COMMENT
0
Entering edit mode

Thanks for your answer! I get the same results in both cases, when I use a FASTA file with two separate, identical sequences or one sequence that is duplicated in itself. However, the latter case is closer to the situation I expect in my data.

ADD REPLY
0
Entering edit mode

I believe bwa when aligning makes a large concatenated sequence, so you might have some imbalance if reads can align to the first end "hanging over" onto the second half, but bwa might refuse to align reads that "hang off" the very end.

ADD REPLY

Login before adding your answer.

Traffic: 1400 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6