Hi,
I aligned some dna-seq paired-end data using bbmap. I want to filter out the multi-mapped reads thus I used the ambiguous=toss parameter. However when I checked my alignment file in IGV I found some reads aligning on regions that are found multiple times in the genome (hg19). Here's the command I used :
bbmap.sh path=$bbmapHost local=t ambiguous=toss in1=R1.fastq in2=R2.fastq threads=1 minid=0.90 maxindel=1 mappedonly=t overwrite=t pairedonly=t sam=1.3 out=out.sam
Is there something wrong ?
Here's an example :
BLAT on the highlighted sequence in IGV shows multiple identical regions in the human genome :
http://grch37.ensembl.org/Homo_sapiens/Tools/Blast/Results?db=core;tl=gyu9eR8dd5ffOAcp-2260351
I extracted one of the read of this region of interest and align it again using ambig=toss and also ambig=all . Here are the results for ambig=toss.
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 99 chr14 63006989 42 98M = 63007016 156 AATTTACGGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG GHAAEHFGGFCEFHHHHG3GFHGAE2GHHHBGFGHBGFHHHHHHHHFFGHDGGHH@DFHFBGHHGGEEGHGDFBGHHHHHGFGHBFGHEGCC/FC//@ NM:i:1 AM:i:42
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 147 chr14 63007016 44 129M = 63006989 -156 CAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACC FHGCF11HHHHHD>GF0GG2BFHHHHHGDDFFGFFHGEGFHFFBHHGEBHGHFFFFHGGHGGEHHHHHHEE?CGE?ACGHHFHHGFFHHDHFCCGG?HFHGAF02HHFBABEAFHHHHHHFHGGA30HH NM:i:0 AM:i:44
and here for ambig=all. You can see the results from ambig=toss are also there in addition to other region. Thus the read should be flagged as multimapped and not be reported in the ambig=toss mode.. no ?
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 99 chr14 63006989 42 98M = 63007016 156 AATTTACGGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAG GHAAEHFGGFCEFHHHHG3GFHGAE2GHHHBGFGHBGFHHHHHHHHFFGHDGGHH@DFHFBGHHGGEEGHGDFBGHHHHHGFGHBFGHEGCC/FC//@ NM:i:1 AM:i:42 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 147 chr14 63007016 44 129M = 63006989 -156 CAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCCAAAAGAACAAAGCTGGAGGCATCACACTACC FHGCF11HHHHHD>GF0GG2BFHHHHHGDDFFGFFHGEGFHFFBHHGEBHGHFFFFHGGHGGEHHHHHHEE?CGE?ACGHHFHHGFFHHDHFCCGG?HFHGAF02HHFBABEAFHHHHHHFHGGA30HH NM:i:0 AM:i:44 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 369 chr10 7101478 40 98M chr14 63007016 0 * * NM:i:2 AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59 321 chr10 7101420 43 129M chr14 63006989 0 * * NM:i:1 AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 369 chr10 11774921 40 98M chr14 63007016 0 * * NM:i:2 AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59 321 chr10 11774863 43 129M chr14 63006989 0 * * NM:i:1 AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 369 chr10 15959217 40 98M chr14 63007016 0 * * NM:i:2 AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59 321 chr10 15959159 43 129M chr14 63006989 0 * * NM:i:1 AM:i:43 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 1:N:0:59 369 chr10 21042081 40 98M chr14 63007016 0 * * NM:i:2 AM:i:40 NH:i:5
M00991:78:000000000-AP8FW:1:2109:9321:20334 2:N:0:59 321 chr10 21042023 43 129M chr14 63006989 0 * * NM:i:1 AM:i:43 NH:i:5
edit : check the comment for an additional example
Are those PCR duplicates (i.e. independent reads that are mapping in the same position)? If you used
ambig=toss
then one read that maps in several positions would not be included.Looking at the MAPQ and NM tag, the primary alignment is slightly better than the secondary alignments - read 1 has an edit distance of 1 and read 2 has an edit distance of 0 for the primary alignment, while read 1 has an edit distance of 2 and read 2 has an edit distance of 1 for the secondary alignment. That's more obvious with the "sam=1.4" cigar strings, but the fact that the pair's combined edit distance was 2 less for the primary than secondary alignments is why it was declared unambiguous.
I checked a little bit more in detail. Here's an alignment for a pair of reads in ambiguous=all
So chr14:63006989 is the primary alignment as the AM score is the highest. To confirm I ran BBmap using ambiguous=toss and it gives me the right alignment :
However when I go to the location of the primary alignment (chr14:63006989) and copy the exact sequence where the reads align (chr14:63006989-63007113) :
AATTTACAGATTCAATGCCATCCCCATCAAGCTACCAATGACTTTCTTCACAGAATTGGAAAAAACTACTTGAAAGTTCATATGGAACCAAAAAAGAGCCCGCATCGCCAAGTCAATCCTAAGCC
and align it on hg19 using blat I found two perfect locations :
Here's the blat output : http://grch37.ensembl.org/Homo_sapiens/Tools/Blast/Results?db=core;tl=yRMn9fQ8AuuM68UD-2286190
The second alignment is not reported in the ambiguous=all mode ? But here this read is clearly a multimap one ..
Brian could you clarify this point ?
Thank you for your time
For paired reads, the mapping score also takes into account the insert size; the score is higher the closer it is to the mean insert size. If you map those reads as single-ended rather than paired, they will all be considered ambiguous. Possibly, I should reduce the effect of insert size when determining whether a mapping is ambiguous.