bwa-mem reproducibility
1
3
Entering edit mode
13 months ago
scsc185 ▴ 80

I have a set of paired end fastq files, and I run bwa-mem (v0.7.17-r1188) on the files with the same exact parameters, including the same number of threads, in two different computing clusters. I compare the BAM file produced via samtools stats. and the outputs are different in the two computing clusters. The number of mapped reads differs by about 100 reads, and same for the number of unmapped reads. I don't think the differences are huge, but I was expecting the BAM files to be identical, so I am wondering what caused the discrepancies.

I am aware of the posts on this forum from the past about bwa-mem producing inconsistent results with different threads, but in my case, the thread number is the same. I also know that bwa-mem has a fixed random number seed, which is used to break ties between equally good alignments. My explanation is that in different computers, given the same seed, the random number generator can produce different results.

What do you think?

bam alignment bwa-mem bwa • 1.3k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

Yea I am aware of these, but they have different circumstances, either reordering of the reads or different number of threads. In my instance, the FASTQ files are identical (no shuffling) and the number of threads is same.

ADD REPLY
0
Entering edit mode

Ah, sorry I should have read better. Anyway, I can confirm that sometimes different machines cause different output. I personally had this with quantification and analysis of single-cell RNA-seq where running things 100% identical on a Mac and a Linux workstation causes slight differences. We eventually tracked this down to a machine precision problem, maybe it's the same here.

ADD REPLY
0
Entering edit mode

Many NGS programs produce results that are non-deterministic. This generally isn't a problem. If you must have results that are identical then you will need to use a program like bbmap that supports a deterministic option at run-time.

Brian Bushnell may have a programmers insight on why multi-threaded runs produce non-deterministic output.

ADD REPLY
0
Entering edit mode
13 months ago
dsull ★ 7.1k

Have you tried running bwa-mem on the same computing cluster twice? It is possible that the random number generator that bwa-mem uses produces different results on different machines or compilers.

Also, I don't know the bwa-mem source code but each run will be slightly different even with the same number of threads. Let's say you have two pieces of data: A and B, and each are processed by a different thread. Sometimes A will complete processing first; other times, B will be complete processing first, because those two pieces of data are processed asynchronously.

ADD COMMENT
0
Entering edit mode

I have not tried running in the same computing cluster, but I will. In regard to your second point, how does the finishing order of the threads affect the final results? I mean, in the end, outputs from all the threads are combined, so why does it matter which thread finishes first?

ADD REPLY
0
Entering edit mode

It matters because the order in which the thread outputs are combined matters. Programs oftentimes use some heuristic (i.e. use some information from the first blocks of data or from the most recent blocks of data to do an "update" based on the output of the current block of data).

As was alluded to previously, this is why reordering of the reads makes a difference. Back to multithreading: it is possible that the chunk #2 buffer has completed processing before the chunk #1 buffer (hence, intrinsically reordering the reads to some extent).

As a developer of multithreaded read mapping software, I can tell you that this is certainly something that can happen.

ADD REPLY
0
Entering edit mode

It clicks with me when you say "intrinsically reordering the reads." It makes sense, thank you!

ADD REPLY

Login before adding your answer.

Traffic: 3410 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6