how to make bwa search the entire sequence?
0
0
Entering edit mode
7.0 years ago

I’ve been trying to map 20bp reads using bwa v0.7.17-r1188 but cannot get it to work with the entire zebrafish genome. The same steps work ok when I create a shorter test sample sequence of 2'000'000 nt.

Steps to reproduce (just the 10th chromosome to speed up debugging):

# cat reads.fa
> TTTATTTCCACACTTCATGG chr10:10367146-10367165:, has a 3bp mismatch read on the same chromosome (- strand): TTTATTTCCACATATGATGG/TTTATTTCCACA**T*ATGG chr10:25378199-25378221
TTTATTTCCACACTTCATGG

$ wget -O chr10.fa.gz ftp://ftp.ensembl.org/pub/release-85/fasta/danio_rerio/dna/Danio_rerio.GRCz10.dna.chromosome.10.fa.gz
$ gunzip chr10.fa.gz
$ ./bwa
Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.17-r1188
..
$ ./bwa index chr10.fa
$ ./bwa aln -o 0 -n 3 -k 3 -N -l 20 chr10.fa reads.txt | ./bwa samse -n 3 chr10.fa - reads.txt
# returns just the exact match despite the -N flag:
[bwa_aln_core] print alignments...  0   10  10367146    2   20M *   0   0   TTTATTTCCACACTTCATGG    *   XT:A:U  NM:i:0  X0:i:1  X1:i:135    XM:i:0  XO:i:0  XG:i:0  MD:Z:20

If I cut the fasta file to about a few million nt then bwa does find both reads:

MAXDIFF=3
echo '> chr10 sample' > cut.fa
sed -e 1d chr10.fa  | tr -d '\n' | cut -c10000000-11000000,24000000-26000000 | fold -w 60 >> cut.fa
./bwa index cut.fa
./bwa aln -0 -o 0  -n $MAXDIFF -k $MAXDIFF -N -l 20 cut.fa reads.txt | ./bwa samse -n $MAXDIFF cut.fa - reads.txt  

# outputs
[bwa_aln_core] print alignments...  0   chr10   367147  23  20M *   0   0   TTTATTTCCACACTTCATGG    *   XT:A:U  NM:i:0  X0:i:1  X1:i:1  XM:i:0  XO:i:0  XG:i:0  MD:Z:20XA:Z:chr10,-2378204,20M,3;

But this doesn't find both reads when MAXDIFF is 4!?

With MAXDIFF=3 it seems to work as until the distance between them is <12'000'000nt (-c10000000-11000000,13500000-26000000) and then it stops finding the other read at -c10000000-11000000,13000000-26000000.

Is it possible to make bwa search the entire sequence and find all reads with up to 4 mismatches (3 or 2 or 1 included)?

p.s. I did write to the bwa mailing list 2 days ago, so far no answer..

UPDATE 1:

to cross check bwa, here's what bowtie1 finds on the 10th chromosome for TTTATTTCCACACTTCATGG:

  • 0 reads with 1 mismatch
  • 0 reads with 2 mismatches
  • 3 reads with 3 mismatches:

    TTTATTTCCACACTTCATGG + 10 43867138 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 7:T>C,8:G>C,9:T>A TTTATTTCCACACTTCATGG + 10 38529200 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 1:C>T,3:T>A,19:C>G TTTATTTCCACACTTCATGG + 10 8527478 TTTATTTCCACACTTCATGG IIIIIIIIIIIIIIIIIIII 0 13:A>T,18:C>G,19:A>G

So even here bwa fails to finds all 3 reads with 3 mismatches on a 45 million nucleotides sequence.

UPDATE 2:

just did a brute force search for all 3nt mismatches on the 10th chromosome and indeed, there're 3 of them (TTTATTTCCACATATGATGG, TTTATTTCCACATTTGAAGG, TTTATTTCCACACATCATCA), so this definitely rules out the possibility there's so many of them and bwa decides not to report any.

bwa alignment • 2.0k views
ADD COMMENT
1
Entering edit mode

Perhaps with the whole genome it finds so many alignments of 20 bp and 3 "differences" (mismatches, gaps) that such alignment is not considered. Do you have an upper bound for number of mappings for each read? Perhaps you're excluding those reads that map more than X times.

ADD REPLY
0
Entering edit mode

I want them all, and that's what -N flag should provide, unless I'm mistaken? This is an interesting hypothesis, just to test it quicky, I've tried bowtie on the 10th chromosome and it finds 0 reads with 1 mismatch, 0 reads with 2 mismatches and 3 reads with 3 mismatches. Unfortunately that's the upper limit for bowtie 1, doesn't allow to search for more than 3 so that's why I wanted to use bwa. I'm updating the question with this info.

ADD REPLY
2
Entering edit mode

Not answering your question directly but I also suspect that with 3 errors your short reads are mapping at too many locations and are perhaps getting discarded.

If you are willing to look at alternate programs then conside bbmap.sh from BBMap suite. It allows you to map a multi-mapping read at all locations by using ambig=all flag. You can also play with the minratio= and minid= parameters to tweak alignments. bbmap does not allow you to specify n errors per read directly though.

ADD REPLY
0
Entering edit mode

thanks for the suggestion! will definitely give it a try. However, I don't think there're that many reads with 3 mismatches. Just updated my question, bowtie reports only 3 such cases. But I can quickly do a simple brute-force check if that would rule out this option?

ADD REPLY
0
Entering edit mode

as i suspected, a brute force search only found 3 reads with 3 mismatches, so that rules out this hypothesis of too many hits.

ADD REPLY
0
Entering edit mode

Was bbmap.sh able to find them? Just curious.

ADD REPLY
0
Entering edit mode

i'm only getting the perfect hit with BBmap. Seems like there's no explicit option to set the maximum number of mismatches, but relies on minimum alignment identity (minid param)? Anyways, even setting this to 0.1 only finds the perfect hit. The exact command was bbmap.sh -Xmx16g minid=0.2 maxindel=0 strictmaxindel ref=chr10.fa ambiguous=all in=reads.fa out=reads.sam vslow (also tried with both k=8 and k=15, same result)

ADD REPLY

Login before adding your answer.

Traffic: 2588 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6