Hello,
I am having difficulties understanding the behaviour of BWA when mapping reads to a contig that contains a duplication.
In a viral metagenomics context, I am using BWA to map reads back to my assembly. Some of my contigs contain duplications, either biological or artificial. Ideally, I would like BWA to pick one of the possible positions for each read at random and designate this as the only (primary) alignment.
I couldn't find clear information about how BWA handles the presence of duplications, so I made a test: I took a 855 bp contig and 50 read pairs that I know map to it. Then I copied the sequence of the contig and pasted it directly after the original to get a 1710 bp sequence with two identical halves. When I then map my 50 read pairs against this construct, I see that
a) the distribution of the alignments is not random (the outcome is reproducible and the fist half gets much more alignments than the second half),
b) each read is mapped exactly once and therefore none of them gets a 256 ("not primary alignment") or 2048 ("supplementary alignment") flag,
c) BWA's -M option doesn't change the output at all.
I am a bit puzzled by this behaviour and would appreciate any help to explain it and to achieve the goal of picking an alignment at random if multiple ones are feasable.
Thanks in advance,
Nikolas
Thanks for your answer! Indeed I forgot to mention that I am using
bwa mem
version 0.7.17. Thank you for pointing out the-a
option. This indeed leads to all reads mapping twice (once in each part), in the manner I would expect. One of the alignments gets a 256 flag. But when I filter the reads with the 256 flag out, I get the same result as before. The distribution might be uneven for stochastic reasons but I cant test that because the outcome is absolutely reproducible, not randomly generated in reach run.I cannot share the data, because it's not mine but I can share a screenshot of the alignments from IGV. On the top you see the outcome with the
-a
option. At the bottom without it.So I repeated the test with a different contig (similar length as the first one), but with 900 read pairs mapping as opposed to 50. The distribution of alignments between the two identical halves of the sequence looks much more similar now (see screenshot, bottom). I believe now that the uneven distribution I saw in the first test was indeed a stochastic effect and BWA is chosing the primary alignment semi-randomly if two positions are equally good. "Semi" because the outcome is reproducible given identical inputs (which is a good thing), but random enough for downstream analyses.