Bwa Sampe Very Large Insert Size
2
1
Entering edit mode
12.6 years ago
Bioscientist ★ 1.7k

I've been running BWA sampe; and it seems it just takes forever to finish. And stderr just get stuck at "Unmapped read alignment" step. And BWA also estimates extremely large insert size.

I read some other posts, and seems it's caused by that BWA sometimes will estimate unrealistic insert size (say till 1MB), which makes unmapped mate alignment very time-consuming because it has to try Smith-Waterman over a very large window.

I'm just wondering, is this a bug and bad luck? Why such errors happen? What should I do? Just re-run the program? Because I've been running BWA many many times, and this is the FIRST time ever for me to come across such problem.

Also someone suggests to use "-A" to disable insert size estimate, which however unfortunately will ignore umapped mate alignment, which is not good. Because I need such "problematic" reads for further analysis.

Thanks

bwa • 5.6k views
ADD COMMENT
2
Entering edit mode
12.6 years ago
Drio ▴ 920

The first thing I would do is to plot the insert size distribution of your dataset to see if there is anything funky going on. You may have a bimodal distribution that confuses BWA insert size distribution calculations.

Also, what type of data is this? What type of library? What reference genome are you using? Are you using a draft genome as a reference?

Just to clarify: If BWA estimates incorrectly the insert size it will trigger local alignment for a higher number of pairs than normal. That will increase the number of computations and ultimately the waltime of your job.

Take a look to the lines in the stderr output from sampe:

[bwa_paired_sw] 905 out of 12656 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 9.91 sec

In your case you will see a much higher number of discordant pairs and a elapse time orders of magnitude higher.

Finally, I think that using -A maybe reasonable depending the experiment you are working on.

ADD COMMENT
1
Entering edit mode

I wrote a patch for BWA that adds a new argument for sample (-d INT) that disables Smith-Waterman dynamically if the number of pair rescue attempts is > than INT.

without -d:

[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 27969 out of 48925 Q17 singletons are mated.
[bwa_paired_sw] 1143 out of 56132 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 6839.43 sec

with -d (4000):

[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] Too many rescue attemps, disabling Smith-Waterman for unmapped mates.
[bwa_paired_sw] 1122 out of 1868 Q17 singletons are mated.
[bwa_paired_sw] 95 out of 2133 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 261.08 sec

Do you guys find it useful?

Code here: https://github.com/drio/bwa/commit/250a88ccb30d782cefd34b4cd2f3c66bd2a023be

At the time of writing this comment, the pull request is still under review.

ADD REPLY
0
Entering edit mode

When the alignment not finished yet,how can we plot the insert size distribution? I'm curious,and humbled for your teaching~

ADD REPLY
0
Entering edit mode

Run sampe with -s to disable SW.

ADD REPLY
0
Entering edit mode
12.6 years ago
Swbarnes2 ★ 1.6k

bwa will also do that if you screwed up your command line, and mistyped the name of an input file in a previous step. If you mistyped the name of your fastq in the aln step, I think you still make a .sai file, and sampe will do its best on what it has. So that's a quick thing you can double-check.

ADD COMMENT

Login before adding your answer.

Traffic: 2563 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6