Bwa Mem Have Different Alignment Result When Using Different Threads
4
17
Entering edit mode
11.0 years ago
shl198 ▴ 440

Hi, I used bwa-mem to align paired-end reads, I trimmed one fastq file which is 80 bps long, the mate fastq file is 79 bps long. When I used 8 threads, it mapped more reads than the default 1 thread. Now I am confused, changing the number of threads are supposed to only change the speed, why does it affect the mapping result? Does any one know the reason for that? Any comments are appreciated. Thank you.

• 20k views
ADD COMMENT
2
Entering edit mode

What flags are you using? Does the behaviour persist with the -M flag? Can you identify where the differences are: i.e. reads mapped multiple times or a greater % of raw reads mapped?

ADD REPLY
0
Entering edit mode

I only set -t, for others I used default settings. Do I have to use -M all the time? I use samtools flagstat to test the result. All items are different. The biggest different different is the number of reads properly paired.

ADD REPLY
2
Entering edit mode

That could be a bug in the software, especially taking in account that race conditions are usually more difficult to debug and test - so they have more probability to get through undetected. What version of bwa do you use, and did you have a chance to see what was aligned differently between runs? I rely on bwa-mem in current project and am really interested in knowing more about this issue.

ADD REPLY
1
Entering edit mode

how many more reads did it map?

ADD REPLY
0
Entering edit mode

It mapped about 50,000 more reads.

ADD REPLY
1
Entering edit mode

Please provide a reproducible example.

ADD REPLY
0
Entering edit mode

I am using the 0.7.12 version now, and it has the same problem still!

ADD REPLY
11
Entering edit mode
11.0 years ago
Pavel Senin ★ 1.9k

I've just run a piece of my pipeline and data using different number of threads, the number is greater with more threads. Diff shows that some lines changed, some got added:

$ bwa

Program: bwa (alignment via Burrows-Wheeler transformation)
Version: 0.7.5a-r405
Contact: Heng Li <lh3@sanger.ac.uk>

The command line

bwa mem -t 1 -M dsm2059.fasta D170_R1_val_1.fq.gz D170_R2_val_2.fq.gz | samtools view -F 4 -Sbh - | samtools sort - D170.sorted_T1
bwa mem -t 2 -M dsm2059.fasta D170_R1_val_1.fq.gz D170_R2_val_2.fq.gz | samtools view -F 4 -Sbh - | samtools sort - D170.sorted_T2
bwa mem -t 3 -M dsm2059.fasta D170_R1_val_1.fq.gz D170_R2_val_2.fq.gz | samtools view -F 4 -Sbh - | samtools sort - D170.sorted_T3
bwa mem -t 4 -M dsm2059.fasta D170_R1_val_1.fq.gz D170_R2_val_2.fq.gz | samtools view -F 4 -Sbh - | samtools sort - D170.sorted_T4

The number of aligned reads is somewhat different:

$ samtools flagstat D170.sorted_T1.bam
53260 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
53260 + 0 mapped (100.00%:-nan%)
53260 + 0 paired in sequencing
26733 + 0 read1
26527 + 0 read2
32537 + 0 properly paired (61.09%:-nan%)
32934 + 0 with itself and mate mapped
20326 + 0 singletons (38.16%:-nan%)
85 + 0 with mate mapped to a different chr
65 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat D170.sorted_T2.bam
53388 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
53388 + 0 mapped (100.00%:-nan%)
53388 + 0 paired in sequencing
26797 + 0 read1
26591 + 0 read2
32862 + 0 properly paired (61.55%:-nan%)
33117 + 0 with itself and mate mapped
20271 + 0 singletons (37.97%:-nan%)
83 + 0 with mate mapped to a different chr
64 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat D170.sorted_T3.bam
53419 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
53419 + 0 mapped (100.00%:-nan%)
53419 + 0 paired in sequencing
26815 + 0 read1
26604 + 0 read2
32939 + 0 properly paired (61.66%:-nan%)
33156 + 0 with itself and mate mapped
20263 + 0 singletons (37.93%:-nan%)
87 + 0 with mate mapped to a different chr
65 + 0 with mate mapped to a different chr (mapQ>=5)

$ samtools flagstat D170.sorted_T4.bam
53442 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
53442 + 0 mapped (100.00%:-nan%)
53442 + 0 paired in sequencing
26828 + 0 read1
26614 + 0 read2
33006 + 0 properly paired (61.76%:-nan%)
33181 + 0 with itself and mate mapped
20261 + 0 singletons (37.91%:-nan%)
83 + 0 with mate mapped to a different chr
64 + 0 with mate mapped to a different chr (mapQ>=5)

That read, seems to be longer with 4 threads, maybe it has to do with seeds? enter image description here

ADD COMMENT
3
Entering edit mode

Out of curiosity, do you get identical results if you rerun one of these? I'm wondering if there's a random number generator being used somewhere for seeding, that might cause slightly different results with each run (in fact, just greping through the bwa source code, it makes use of random seeds and such in a number of places).

ADD REPLY
2
Entering edit mode

I repeated runs with 8 threads and got identical results.

ADD REPLY
0
Entering edit mode

Second to that, I run for four and three threads with the same counts.

ADD REPLY
8
Entering edit mode
10.9 years ago
shl198 ▴ 440

For those who may concern, I got answer from the developer of bwa mem. There is a bug in bwa-mem running with different threads in the version 0.7.5a. The other master branch versions may work fine.

ADD COMMENT
0
Entering edit mode

thanks, it's good to know, that the problem is acknowledged

ADD REPLY
0
Entering edit mode

Is it all 0.7.5a versions? There are several revisions to 0.7.5.a in the master branch

ADD REPLY
0
Entering edit mode

Only the released version in source forge has the bug, in github I think he fixed it. This is what he said: "There is a bug in 0.7.5a which affects the randomness. The master branch in git should not have this issue if I am right".

ADD REPLY
0
Entering edit mode

Sounds like it. That's a relief as I have been using the 0.7.5a version from GitHub in my production runs for awhile now. I am testing with the current github version to double check overnight/tomorrow

ADD REPLY
0
Entering edit mode

So, is this solved?

the developer doesn't sound certain it's been fixed. Has anyone checked?

ADD REPLY
0
Entering edit mode

I can confirm that I am having this problem with bwa version 0.7.12.

ADD REPLY
0
Entering edit mode

BWA is now up to version 0.7.17, 0.7.12 was released in December of 2014. The solution is to upgrade your BWA version

ADD REPLY
0
Entering edit mode

I've been banging my head for about a week and i've just come across this thread - glad Pavel Senin has confirmed this. I'm using bwa/0.7.15 so i doubt its fixed. P

ADD REPLY
2
Entering edit mode
5.7 years ago
GenoMax 148k

Heng Li had addressed this previously: A: BWA mem output inconsistent on same but re-ordered FASTQ input

ADD COMMENT
0
Entering edit mode

Thank you! that was my problem and now I know the reason

ADD REPLY
0
Entering edit mode
5.7 years ago

I have to say that after several years, this issue still persists, which I think is a quite big reproducibility issue. Especially when the pipeline consists of many steps, the inconsistencies accumulate on top of each other like a snowball.

I use version 0.7.17-r1188

bwa mem -t 10 SC_masked 4_SC-NA_2N_col1_28279_TCCTGAtr_1P.fastq.gz  4_SC-NA_2N_col1_28279_TCCTGAtr_2P.fastq.gz | samtools sort -@10 -o SC1.bam -

61544262 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
330 + 0 supplementary
0 + 0 duplicates
47333309 + 0 mapped (76.91% : N/A)
61543932 + 0 paired in sequencing
30771966 + 0 read1
30771966 + 0 read2
46942664 + 0 properly paired (76.28% : N/A)
47180978 + 0 with itself and mate mapped
152001 + 0 singletons (0.25% : N/A)
87678 + 0 with mate mapped to a different chr
45985 + 0 with mate mapped to a different chr (mapQ>=5)

bwa mem -t 4 SC_masked 4_SC-NA_2N_col1_28279_TCCTGAtr_1P.fastq.gz  4_SC-NA_2N_col1_28279_TCCTGAtr_2P.fastq.gz | samtools sort -@4 -o SC1.bam -

61544262 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 secondary
330 + 0 supplementary
0 + 0 duplicates
47333217 + 0 mapped (76.91% : N/A)
61543932 + 0 paired in sequencing
30771966 + 0 read1
30771966 + 0 read2
46943592 + 0 properly paired (76.28% : N/A)
47180798 + 0 with itself and mate mapped
152089 + 0 singletons (0.25% : N/A)
87692 + 0 with mate mapped to a different chr
46000 + 0 with mate mapped to a different chr (mapQ>=5)
ADD COMMENT
1
Entering edit mode

I think (but not sure) Bowtie2 is deterministic, as it doesn't estimate on the fly insert size. If you consider exactly reproducibility very important, then maybe is an aligner to consider / test.

ADD REPLY
1
Entering edit mode

From bowtie2 manual:

If you run the same version of Bowtie 2 on two reads with identical names, nucleotide strings, and quality strings, and if --seed is set the same for both runs, Bowtie 2 will produce the same output; i.e., it will align the read to the same place, even if there are multiple equally good alignments. This is intuitive and desirable in most cases.

It does not appear to do this by default.

Bowtie 2’s search for alignments for a given read is “randomized.” That is, when Bowtie 2 encounters a set of equally-good choices, it uses a pseudo-random number to choose. For example, if Bowtie 2 discovers a set of 3 equally-good alignments and wants to decide which to report, it picks a pseudo-random integer 0, 1 or 2 and reports the corresponding alignment. Arbitrary choices can crop up at various points during alignment.

ADD REPLY
0
Entering edit mode

The question is if a difference in mapping rate of < 0.001% is indeed a quite big reproducibility issue. I've never checked but one would need to see if that indeed and notably affects genome coverage and therefore downstream applications, I would assume it doesn't and these differences are probably (but this is guessing) concentrated at difficult genomic regions (low complexity) which for most downstream applications would be filtered out anyway.

ADD REPLY
0
Entering edit mode

Well, something is either reproducible or not, and in this case BWA is one of the most used software in bioinformatics, that's why I say quite big. Imagine how much time in total the community would spend to figure out what is wrong in a large pipeline, which produces slightly different results. And the parameter of cores is intuitively not the first thing to check (in my case at least) when a large pipeline behaves oddly. Regarding how much this affects the results, so in my case the downstream MACS2 and DiffBind analysis showed around 10 more open chromatin regions (~0.5% of total regions). Even if the number is not huge, I was about to redo all the analysis and plots, which might take a couple of days.

ADD REPLY
0
Entering edit mode

Well, something is either reproducible or not

That depends on what these new 10 regions are. If these are problematic regions prone to show false-positives then this is probably not too worrying. Can you upload the coordinates of the regions and the genome version. Would like to know if they are in my personal blacklists, just out of curiosity.

ADD REPLY
0
Entering edit mode

Sorry, I already rerun the pipeline with the number of cores I had before. But I work with a non-model yeast S. uvarum, here is the genome http://www.saccharomycessensustricto.org/cgi-bin/s3.cgi?data=Assemblies&version=current, so in general there are a lot of headaches with it :)

ADD REPLY
0
Entering edit mode

something is either reproducible or not

I think this is a common misconception about what reproducibility should really mean. The goal shouldn't be to get the exact same result, since that's not even possible by using deterministic software in a docker container on different systems. The goal instead should be to get within some reasonable epsilon of the same results while allowing the same conclusions to be drawn.

ADD REPLY
0
Entering edit mode

I would say that this epsilon should really depend on the area of research, while for experimental biology of course nothing can be 100% reproducible, for bioinofrmatics part I guess it should be really close to 0. Also depends on what is the conclusion, if it is a number of open-chromatin peaks, so it should be exactly the same :)

ADD REPLY
0
Entering edit mode

I guess it is how NGS aligners work (they make assumptions/programmatic compromises and you make some additional choices when you run them).

Brian Bushnell (developer of BBMap suite) had said in a past conversation :

BBMap is always nondeterministic when run in paired-end mode with multiple threads, because the insert-size average is calculated on a per-thread basis, which affects mapping; and which reads are assigned to which thread is nondeterministic. The only way to avoid that would be to restrict it to a single thread

Something similar is likely happening here.

BBMap does provide a way to run deterministically:

deterministic=f         Run in deterministic mode.  In this case it is good
                        to set averagepairdist.  BBMap is deterministic
                        without this flag if using single-ended reads,
                        or run singlethreaded.
ADD REPLY

Login before adding your answer.

Traffic: 1329 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6