I’ve been trying to map 20bp reads using bwa v0.7.17-r1188 but cannot get it to work with the entire zebrafish genome. The same steps work ok when I create a shorter test sample sequence of 2'000'000 nt.
Steps to reproduce (just the 10th chromosome to speed up debugging):
# cat reads.fa
> TTTATTTCCACACTTCATGG chr10:10367146-10367165:, has a 3bp mismatch read on the same chromosome (- strand): TTTATTTCCACATATGATGG/TTTATTTCCACA**T*ATGG chr10:25378199-25378221
TTTATTTCCACACTTCATGG
$ wget -O chr10.fa.gz ftp://ftp.ensembl.org/pub/release-85/fasta/danio_rerio/dna/Danio_rerio.GRCz10.dna.chromosome.10.fa.gz
$ gunzip chr10.fa.gz
$ ./bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
..
$ ./bwa index chr10.fa
$ ./bwa aln -o 0 -n 3 -k 3 -N -l 20 chr10.fa reads.txt | ./bwa samse -n 3 chr10.fa - reads.txt
# returns just the exact match despite the -N flag:
[bwa_aln_core] print alignments... 0 10 10367146 2 20M * 0 0 TTTATTTCCACACTTCATGG * XT:A:U NM:i:0 X0:i:1 X1:i:135 XM:i:0 XO:i:0 XG:i:0 MD:Z:20
If I cut the fasta file to about a few million nt then bwa does find both reads:
MAXDIFF=3
echo '> chr10 sample' > cut.fa
sed -e 1d chr10.fa | tr -d '\n' | cut -c10000000-11000000,24000000-26000000 | fold -w 60 >> cut.fa
./bwa index cut.fa
./bwa aln -0 -o 0 -n $MAXDIFF -k $MAXDIFF -N -l 20 cut.fa reads.txt | ./bwa samse -n $MAXDIFF cut.fa - reads.txt
# outputs
[bwa_aln_core] print alignments... 0 chr10 367147 23 20M * 0 0 TTTATTTCCACACTTCATGG * XT:A:U NM:i:0 X0:i:1 X1:i:1 XM:i:0 XO:i:0 XG:i:0 MD:Z:20XA:Z:chr10,-2378204,20M,3;
But this doesn't find both reads when MAXDIFF is 4!?
With MAXDIFF=3 it seems to work as until the distance between them is <12'000'000nt (-c10000000-11000000,13500000-26000000) and then it stops finding the other read at -c10000000-11000000,13000000-26000000.
Is it possible to make bwa search the entire sequence and find all reads with up to 4 mismatches (3 or 2 or 1 included)?
p.s. I did write to the bwa mailing list 2 days ago, so far no answer..
UPDATE 1:
to cross check bwa, here's what bowtie1 finds on the 10th chromosome for TTTATTTCCACACTTCATGG:
- 0 reads with 1 mismatch
- 0 reads with 2 mismatches
3 reads with 3 mismatches:
TTTATTTCCACACTTCATGG + 10 43867138 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 7:T>C,8:G>C,9:T>A TTTATTTCCACACTTCATGG + 10 38529200 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 1:C>T,3:T>A,19:C>G TTTATTTCCACACTTCATGG + 10 8527478 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 13:A>T,18:C>G,19:A>G
So even here bwa fails to finds all 3 reads with 3 mismatches on a 45 million nucleotides sequence.
UPDATE 2:
just did a brute force search for all 3nt mismatches on the 10th chromosome and indeed, there're 3 of them (TTTATTTCCACATATGATGG, TTTATTTCCACATTTGAAGG, TTTATTTCCACACATCATCA), so this definitely rules out the possibility there's so many of them and bwa decides not to report any.
Perhaps with the whole genome it finds so many alignments of 20 bp and 3 "differences" (mismatches, gaps) that such alignment is not considered. Do you have an upper bound for number of mappings for each read? Perhaps you're excluding those reads that map more than X times.
I want them all, and that's what -N flag should provide, unless I'm mistaken? This is an interesting hypothesis, just to test it quicky, I've tried bowtie on the 10th chromosome and it finds 0 reads with 1 mismatch, 0 reads with 2 mismatches and 3 reads with 3 mismatches. Unfortunately that's the upper limit for bowtie 1, doesn't allow to search for more than 3 so that's why I wanted to use bwa. I'm updating the question with this info.
Not answering your question directly but I also suspect that with 3 errors your short reads are mapping at too many locations and are perhaps getting discarded.
If you are willing to look at alternate programs then conside
bbmap.sh
from BBMap suite. It allows you to map a multi-mapping read atall
locations by usingambig=all
flag. You can also play with theminratio=
andminid=
parameters to tweak alignments.bbmap
does not allow you to specifyn
errors per read directly though.thanks for the suggestion! will definitely give it a try. However, I don't think there're that many reads with 3 mismatches. Just updated my question, bowtie reports only 3 such cases. But I can quickly do a simple brute-force check if that would rule out this option?
as i suspected, a brute force search only found 3 reads with 3 mismatches, so that rules out this hypothesis of too many hits.
Was
bbmap.sh
able to find them? Just curious.i'm only getting the perfect hit with BBmap. Seems like there's no explicit option to set the maximum number of mismatches, but relies on minimum alignment identity (minid param)? Anyways, even setting this to 0.1 only finds the perfect hit. The exact command was
bbmap.sh -Xmx16g minid=0.2 maxindel=0 strictmaxindel ref=chr10.fa ambiguous=all in=reads.fa out=reads.sam vslow
(also tried with both k=8 and k=15, same result)