Question

when bwa mem run with different thread, output.sam is different.

1

Entering edit mode

8.2 years ago

Peavil ▴ 10

Hi,

I run bwa-mem with same sample, different number of thread.

and then, output file(sam) is different.

my bwa version is 0.7.13,

I see the bwa notice (https://github.com/lh3/bwa/blob/master/NEWS.md)

Release 0.7.13 (23 Feburary 2016)

Fixed a potential bug in the multithreading mode. It may occur when mapping is much faster than file reading, which should almost never happen in practice.

Why they are still different?

thread running data(samtools flagstat)

150380122 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

145716 + 0 supplementary

0 + 0 duplicates

150143022 + 0 mapped (99.84% : N/A)

150234406 + 0 paired in sequencing

75117203 + 0 read1

75117203 + 0 read2

148053330 + 0 properly paired (98.55% : N/A)

149791510 + 0 with itself and mate mapped

205796 + 0 singletons (0.14% : N/A)

1519238 + 0 with mate mapped to a different chr

1330739 + 0 with mate mapped to a different chr (mapQ>=5)

no-thread running data(samtools flagstat)

150380124 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 secondary

145718 + 0 supplementary

0 + 0 duplicates

150143026 + 0 mapped (99.84% : N/A)

150234406 + 0 paired in sequencing
75117203 + 0 read1

75117203 + 0 read2

148053228 + 0 properly paired (98.55% : N/A)

149791516 + 0 with itself and mate mapped

205792 + 0 singletons (0.14% : N/A)

1519274 + 0 with mate mapped to a different chr

1330758 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem sequencing alignment • 2.9k views

ADD COMMENT • link updated 8.2 years ago by GenoMax 150k • written 8.2 years ago by Peavil ▴ 10

score 2 · Answer 1 · 2017-01-09

2

Entering edit mode

8.2 years ago

Brian Bushnell 20k

I'm not sure about bwa, but BBMap tracks the average insert size independently in each thread, to reduce the penalty for communicating between threads. As a result, BBMap has slightly non-deterministic output when run multithreaded with paired reads, because the decision of whether or not reads are properly paired - which also affects the decision of where to best place each read in a pair - is affected by the average insert size, which changes as the program runs. This is a very difficult problem to solve optimally in a multithreaded context without running the program multiple times, or something similarly slow, and may be generally impossible to solve in any context for an algorithm that uses insert-size statistics to improve accuracy (and is thus dependent on the output of a prior run). So, I believe your slightly non-deterministic results are by design rather than the result of a bug. It's likely that if you reorder the input by shuffling it, it will give similarly different results even in singlethreaded mode (BBMap will).

In practice, this does not generally lead to noticeable negative effects, because the difference is very small and only affects reads that map ambiguously, typically with low quality, such that the mapper can't really place them accurately.

ADD COMMENT • link 8.2 years ago by Brian Bushnell 20k

0

Entering edit mode

First, thanks for your information.

I'm confused, if a different data are low quality, shouldn't it filtered on pipeline like 'base recalibrator'?

when I run Mutect2 with two different mapping data from one sample, they resulted slightly different number of 'PASS' filtered mutation from Mutect2 output.

is it no problem in practice~?

thanks again. (I am bioinfo newbie. sorry about basic question.)

ADD REPLY • link 8.2 years ago by Peavil ▴ 10

2

Entering edit mode

Not necessarily. It depends on the variant-caller. There will always be some borderline cases. Let's say any variant with quality>20 passes filter, and variants with quality<20 fail filter. A single read mapped to a different place might change a variant from quality 20.07 to quality 19.91. So it goes from PASS to FAIL. It's still the same variant, and the evidence is basically the same.

You have to understand that you are using programs as a proxy for your intelligence. You could, of course, examine all 20 million variants individually, and if you were perfect, you'd do a better job than the variant-caller. But, in practice, you'd get bored and make mistakes (and it would take forever). So you are using variant-callers as a shortcut. As such, it's a good idea to study the variant calls, to be able to determine which ones are obviously true, and which are obviously false. With enough experience, your knowledge will be better than the hard-coded rules of current variant-callers. At that point, you can just set a somewhat lower threshold and manually examine all borderline cases to determine which are real and which are not. I guess, a variant-caller should be considered as more like triage than diagnosis.