I would like to know how BWA mem handles repetitive hits.
I know that bwa aln/samse- given a read that maps to multiple regions- will randomly select one region. Each query will be listed once in the sam file, and alternative mapping locations for that read will be listed in XA if the number of hits is less than INT hits.
But what about BWA mem? I keep reading that BWA mem differs from bwa aln in that it will split queries, but what about reads that aren't split but still map to multiple locations? Does mem also chose a random location? Are multiply mapped reads reported in the same fashion in bwa mem as they are for bwa aln/samse (that is, with one listing per query and XAs reported if less than INT)?
For simplicity's sake, I am not interested in using other aligners. I have pre-trimmed RAD data (80-139bp) from a frog species with a highly repetitive genome.
In general, short read aligners take some liberties in the way they report the alignments - as it turns out bwa is no different.
Alas in most cases there is surprisingly little information in the docs on the specifics on what gets reported in which way. The following is
my best understanding of how bwa mem chooses to report alignments:
When a read matches in its entirety, with an equal score in multiple locations, one of the locations is picked at random, is labeled as primary, will be given a mapping quality of zero and will have an XA tag that contains the alternative locations (this is identical to how bwa aln worked)
When different, non-overlapping regions of a read align with high scores to different, non-linear locations in the genome, the higher score alignment will be labeled as primary, the others may be reported as secondary alignments. There is some threshold on how many of these secondary alignments will be reported (bwa aln did not produce secondary alignments)
When complementary regions of a read (the pieces add up to the full read) align to different, non-linear genomic locations, with no little to no overlap, one of the alignments will be labeled as primary, the others as supplementary alignments (bwa aln did not produce supplementary alignments)
There are some options to filter for hits, though behind the scenes may be other tacit and undisclosed assumptions as to how short of an alignment it can find, regardless of the score. My guess that the mem method itself won't work under ~ 20bp.
-T INT minimum score to output [30]
-h INT[,INT] if there are <INT hits with score >80% of the max score, output all in XA [5,200]
-a output all alignments for SE or unpaired PE
You made a good point above. Even with well written programs (like bwa) it is not always possible to know the entire truth (for an end-user). Much as I like bbmap I discovered the other day that it does not seem to write all alignments for multi-mapping reads to output file (again a decision that may have been made to keep the file size sane).
The essential thing to remember with all short read aligners that they were never designed to be "optimal" from any point of view.
This is why these algorithms are so mindbogglingly fast. They use various shortcuts and sacrifice optimality to be able to achieve their primary use case - finding one, the best single hit if one such hit exists. That's why they can align tens of thousands of reads per second against the human genome.
If you want to find near-exact matches, with few alternatives, the algorithms work very well.
For any other use case, the results are approximations with a varying range of errors.
When using a tool like reformat.sh from BBMap, would the secondary alignments (i.e. when it maps to another spot in the genome) be included in the unmappedonly=t flag?
Oh ok cool, so that means both the reads would be considered mapped. Awesome, I was trying to extract the unmapped reads to figure out what was going on with my assembly.
When a read matches in its entirety, with an equal score in multiple
locations, one of the locations is picked at random, is labeled as
primary, will be given a mapping quality of zero and will have an XA
tag that contains the alternative locations (this is identical to how
bwa aln worked)
My suggestion to complement the other answer(s) is to prepare some test data and see what bwa does. For example, create a dummy genome of few hundred bases where two or more regions are exactly duplicated. Then prepare a fastq file with one or more reads matching (part of) the duplicated region. Map and see what happens. You can play with the number of mismatches, base qualities, read length etc.
Of course, you don't get a definitive answer since you can't cover all cases, but at least you can tell for sure how bwa behaves with known cases. Also, this approach forces you to think what it is you want to test. (Essentially, this kind of approach is similar to test driven development).
One challenge with this approach is that one needs to understand what they are trying to capture - possibly before knowing that.
The information encoded in sequences (genomes) is very rich and it is difficult to capture the characteristics of each problem. A method that works well for uniform base distribution may perform inadequately if the genome has internal patterns, repeating motifs etc.
Do you know if
bwa mem
reports this for all possible matches for a read or only a default number?There are some options to filter for hits, though behind the scenes may be other tacit and undisclosed assumptions as to how short of an alignment it can find, regardless of the score. My guess that the
mem
method itself won't work under ~ 20bp.Thanks.
You made a good point above. Even with well written programs (like
bwa
) it is not always possible to know the entire truth (for an end-user). Much as I likebbmap
I discovered the other day that it does not seem to write all alignments for multi-mapping reads to output file (again a decision that may have been made to keep the file size sane).Thank you for the helpful reply! I was definitely surprised by how little information I could find about multiply mapped reads in the official docs.
The essential thing to remember with all short read aligners that they were never designed to be "optimal" from any point of view.
This is why these algorithms are so mindbogglingly fast. They use various shortcuts and sacrifice optimality to be able to achieve their primary use case - finding one, the best single hit if one such hit exists. That's why they can align tens of thousands of reads per second against the human genome.
If you want to find near-exact matches, with few alternatives, the algorithms work very well.
For any other use case, the results are approximations with a varying range of errors.
When using a tool like
reformat.sh
from BBMap, would the secondary alignments (i.e. when it maps to another spot in the genome) be included in theunmappedonly=t
flag?Not if you set
primaryonly=t
. Default isprimaryonly=f
.Oh ok cool, so that means both the reads would be considered mapped. Awesome, I was trying to extract the unmapped reads to figure out what was going on with my assembly.
Does anyone know where:
happens in the bwa source code? https://github.com/lh3/bwa/blob/9f26bfcc7780753129b60717ecab0ebba6f04b7c/bwamem.c#L521 was showing promise, but I can't figure out where exactly the random selection occurs.