BWA-SW does not find all matches for me, it seems to pick only one random best match. I have no idea how to modify the arguments to have BWA find the secondary matches.
Example:
temp.fa:
>hg38-chr6-test
AACACAGGCCGGACAGAAGCTTGGAAGGTCCTGTCTCCCCAGGGAGGAGGCCCCTGGGACAGTGTGGCTCGTGTCCTTCC
CAACGGCTCCCTCTTCCTTCCGGCTGTCGGGATCCAGGATGAGGGGATTTTCCGGTGCCAGGCAATGAACAGGAATGGAA
AGGAGACCAAGTCCAACTACCGAGTCCGTGTCTACC
This input sequence has five identical matches according to BLAT:
YourSeq 196 1 196 196 100.0% chr6_ssto_hap7 - 3499060 3499255 196
YourSeq 196 1 196 196 100.0% chr6_qbl_hap6 - 3412352 3412547 196
YourSeq 196 1 196 196 100.0% chr6_mann_hap4 - 3494156 3494351 196
YourSeq 196 1 196 196 100.0% chr6_cox_hap2 - 3622013 3622208 196
YourSeq 196 1 196 196 100.0% chr6 - 32151332 32151527 196
but just a single match with bwa-sw when I run this command:
bwa bwasw hg38.fa temp.fa
Output:
hg38_dna 16 chr6_GL000253v2_alt 3488536 0 196M * 0 0 GGTAGACACGGACTCGGTAGTTGGACTTGGTCTCCTTTCCATTCCTGTTCATTGCCTGGCACCGGAAAATCCCCTCATCCTGGATCCCGACAGCCGGAAGGAAGAGGGAGCCGTTGGGAAGGACACGAGCCACACTGTCCCAGGGGCCTCCTCCCTGGGGAGACAGGACCTTCCAAGCTTCTGTCCGGCCTGTGTT * AS:i:196 XS:i:196 XF:i:1 XE:i:0 NM:i:0
BWA versions tested: 0.7.5a-r405, 0.7.17-r1194-dirty
I tried with the "-c 0.0001" argument, makes no difference.
It has a MAPQ of 0, indicating a multimapper, which fits the BLAT result. This is normal behaviour that multimappers get a MAPQ of 0 and a random (out of all found matches) location being send to output. Is there a specific reason why you need these multimappers?
Oh wow, yes, it's indeed there, just not documented under bwasw but bwa backtrack: "Repetitive hits will be randomly chosen." manytThanks!!
Well, yeah, I need to know the best match for the input sequence and the _alt match is definitely not a very useful match. I thought that BWA knew better these days how to deal with the _alt chroms.
Non-unique matches are very common as soon as you run into chrom regions that have an _alt... I am surprised that this has not come up before somewhere else. I wonder what the normal way to deal with this is, for bwa mem...
I think you should use bwa mem: 1) I am not sure bwa sw is alt-aware, but I am sure bwa mem is, and 2) bwa mem is the recommended algorithm for reads longer than 70bp. Also check this post:
Bwa: How To Get Multiple Hits Using The Algorithm Bwasw?
In this case, my input sequence can be <= 3000 bp.... but you're right, maybe I should use BWA-mem for this? I used bwasw because the input is pretty long, by BWAMEM standards.
bwa mem can align long sequences, but minimap2 (also developed by Heng Li) is preferred for such sequences, as it is faster and more sensitive.
edit: the post Minimap2 and the future of BWA is an interesting read