I am using BBMAP to align to a reference genome. I am allowing secondary hits because it is my intention to evaluate the mapping quality of those hits to identify and discard query sequences that produce multiple hits with MAPQs above a certain threshold compared to the top hit (i.e. I only want to keep reads with top hits sufficiently dissimilar to secondary hits). I've tried a number of variations on this cmdline:
bbmap.sh trimq=13 mintrimlength=100 minaveragequality=15 outm=test.aligned.sam outu=test.unaligned.sam in=test.fq ref=MAIZE_B73_REF_5_GENOME secondary=t maxsites=2 k=8 ambiguous=all
test.fq contains two sequences for illustration.
the SAM file produced (minus header)
K00333:82:HTWHKBBXX:1:1101:22597:1543 1:N:0:TGCAACCGGTCC 16 MAIZE_B73_REF_5_GENOME|3 87398937 44 150=
* 0 0 AAAAGGCGAAGAAGCTCCTAAGGGAGGCATACAGCGAAAAATAAACGCTGATACCGCCGAAATAGAAGGCGAAGAAGCTCCTAAGGGAGGCTTGCATCAAAAAGTCAGCGCTGATCTTCGGCACAAATATCATTTTGCATAACATCACAT JJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJAJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJFAFAA NM:i:0 AM:i:44 NH:i:1
K00333:82:HTWHKBBXX:1:1101:15057:1806 1:N:0:TGCAACCGGTCC 0 MAIZE_B73_REF_5_GENOME|4 45893970 40 150=
* 0 0 ATCCGGGGGCTCGGAAGGTGGAAAGACACGATGCATAAGGGAGCACGAAGACGTGGTCGCCTTTCAAAGGGGTCGCCCTCCTTTTAAAGGCGACTCTCCCTACTCGCGTCCCCAACCGTCGTGGGCTGAGTCTTCTCCAACACGCTCCAA AAFFFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJFJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJ NM:i:0 AM:i:40 NH:i:2
K00333:82:HTWHKBBXX:1:1101:15057:1806 1:N:0:TGCAACCGGTCC 256 MAIZE_B73_REF_5_GENOME|3 115620153 38 52=1X14=1X82= * 0 0 * * NM:i:2 AM:i:38 NH:i:2
This seems to indicate that secondary hits are being found (K00333:82:HTWHKBBXX:1:1101:15057:1806) but not for K00333:82:HTWHKBBXX:1:1101:22597:1543
If I then take K00333:82:HTWHKBBXX:1:1101:22597:1543 and blast it against this reference I get many hits. For example
Top blast hit
> MAIZE_B73_REF_5_GENOME|3 SRS link
Length=235667834
Score = 278 bits (150), Expect = 5e-73
Identities = 150/150 (100%), Gaps = 0/150 (0%)
Strand=Plus/Minus
Query 1 ATGTGATGTTATGCAAAATGATATTTGTGCCGAAGATCAGCGCTGACTTTTTGATGCAAG 60
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 87399086 ATGTGATGTTATGCAAAATGATATTTGTGCCGAAGATCAGCGCTGACTTTTTGATGCAAG 87399027
Query 61 CCTCCCTTAGGAGCTTCTTCGCCTTCTATTTCGGCGGTATCAGCGTTTATTTTTCGCTGT 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 87399026 CCTCCCTTAGGAGCTTCTTCGCCTTCTATTTCGGCGGTATCAGCGTTTATTTTTCGCTGT 87398967
Query 121 ATGCCTCCCTTAGGAGCTTCTTCGCCTTTT 150
||||||||||||||||||||||||||||||
Sbjct 87398966 ATGCCTCCCTTAGGAGCTTCTTCGCCTTTT 87398937
Second best blast hit
> MAIZE_B73_REF_5_GENOME|10 SRS link
Length=150982314
Score = 257 bits (139), Expect = 6e-67
Identities = 145/148 (98%), Gaps = 0/148 (0%)
Strand=Plus/Plus
Query 1 ATGTGATGTTATGCAAAATGATATTTGTGCCGAAGATCAGCGCTGACTTTTTGATGCAAG 60
||||||||||||||||||||||||||||||||||||||||||||||||||| | ||||||
Sbjct 15816931 ATGTGATGTTATGCAAAATGATATTTGTGCCGAAGATCAGCGCTGACTTTTCGCTGCAAG 15816990
Query 61 CCTCCCTTAGGAGCTTCTTCGCCTTCTATTTCGGCGGTATCAGCGTTTATTTTTCGCTGT 120
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
Sbjct 15816991 CCTCCCTTAGGAGCTTCTTCGCCTTCTATTTCGGCGGTATCAGCGTTTATTTTTCGCTGT 15817050
Query 121 ATGCCTCCCTTAGGAGCTTCTTCGCCTT 148
| ||||||||||||||||||||||||||
Sbjct 15817051 AAGCCTCCCTTAGGAGCTTCTTCGCCTT 15817078
So my question is, is there someway to parameterize BBMAP to allow it to find such other sequences in the reference or, if not, is there another short read aligner that would and produce meaningful MAPQ scores that I could use for my filtering?
Thanks
Generally mapq scores are set to zero when a read multi-maps. Since there is no standard of how these scores are assigned it is hard to generalize how each aligner uses them.
I suggest that you just map without additional parameters like
trimq=13 mintrimlength=100 minaveragequality=15 maxsites=2
(I don't use them personally) and see what happens. I recall thatbbmap
does not report every alignment if a read multi-maps (beyond a certain limit). @Brian may have done that conciously to keep the size of the alignment files in check.