Question

Bwa Sampe Very Large Insert Size

1

Entering edit mode

12.7 years ago

Bioscientist ★ 1.7k

I've been running BWA sampe; and it seems it just takes forever to finish. And stderr just get stuck at "Unmapped read alignment" step. And BWA also estimates extremely large insert size.

I read some other posts, and seems it's caused by that BWA sometimes will estimate unrealistic insert size (say till 1MB), which makes unmapped mate alignment very time-consuming because it has to try Smith-Waterman over a very large window.

I'm just wondering, is this a bug and bad luck? Why such errors happen? What should I do? Just re-run the program? Because I've been running BWA many many times, and this is the FIRST time ever for me to come across such problem.

Also someone suggests to use "-A" to disable insert size estimate, which however unfortunately will ignore umapped mate alignment, which is not good. Because I need such "problematic" reads for further analysis.

Thanks

bwa • 5.7k views

ADD COMMENT • link updated 12.7 years ago by Drio ▴ 920 • written 12.7 years ago by Bioscientist ★ 1.7k

score 2 · Answer 1 · 2012-04-05

2

Entering edit mode

12.7 years ago

Drio ▴ 920

The first thing I would do is to plot the insert size distribution of your dataset to see if there is anything funky going on. You may have a bimodal distribution that confuses BWA insert size distribution calculations.

Also, what type of data is this? What type of library? What reference genome are you using? Are you using a draft genome as a reference?

Just to clarify: If BWA estimates incorrectly the insert size it will trigger local alignment for a higher number of pairs than normal. That will increase the number of computations and ultimately the waltime of your job.

Take a look to the lines in the stderr output from sampe:

[bwa_paired_sw] 905 out of 12656 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 9.91 sec

In your case you will see a much higher number of discordant pairs and a elapse time orders of magnitude higher.

Finally, I think that using -A maybe reasonable depending the experiment you are working on.

ADD COMMENT • link 12.7 years ago by Drio ▴ 920

1

Entering edit mode

I wrote a patch for BWA that adds a new argument for sample (-d INT) that disables Smith-Waterman dynamically if the number of pair rescue attempts is > than INT.

without -d:

[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] 27969 out of 48925 Q17 singletons are mated.
[bwa_paired_sw] 1143 out of 56132 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 6839.43 sec

with -d (4000):

[bwa_sai2sam_pe_core] align unmapped mate...
[bwa_paired_sw] Too many rescue attemps, disabling Smith-Waterman for unmapped mates.
[bwa_paired_sw] 1122 out of 1868 Q17 singletons are mated.
[bwa_paired_sw] 95 out of 2133 Q17 discordant pairs are fixed.
[bwa_sai2sam_pe_core] time elapses: 261.08 sec

Do you guys find it useful?

Code here: https://github.com/drio/bwa/commit/250a88ccb30d782cefd34b4cd2f3c66bd2a023be

At the time of writing this comment, the pull request is still under review.

ADD REPLY • link 12.2 years ago by Drio ▴ 920

0

Entering edit mode

When the alignment not finished yet,how can we plot the insert size distribution? I'm curious,and humbled for your teaching～

ADD REPLY • link 12.7 years ago by Casual ▴ 90

0

Entering edit mode

Run sampe with -s to disable SW.

ADD REPLY • link 12.2 years ago by Drio ▴ 920

score 0 · Answer 2 · 2012-04-05

0

Entering edit mode

12.7 years ago

Swbarnes2 ★ 1.6k

bwa will also do that if you screwed up your command line, and mistyped the name of an input file in a previous step. If you mistyped the name of your fastq in the aln step, I think you still make a .sai file, and sampe will do its best on what it has. So that's a quick thing you can double-check.

ADD COMMENT • link 12.7 years ago by Swbarnes2 ★ 1.6k