I have a set of paired end fastq files, and I run bwa-mem (v0.7.17-r1188) on the files with the same exact parameters, including the same number of threads, in two different computing clusters. I compare the BAM file produced via samtools stats. and the outputs are different in the two computing clusters. The number of mapped reads differs by about 100 reads, and same for the number of unmapped reads. I don't think the differences are huge, but I was expecting the BAM files to be identical, so I am wondering what caused the discrepancies.
I am aware of the posts on this forum from the past about bwa-mem producing inconsistent results with different threads, but in my case, the thread number is the same. I also know that bwa-mem has a fixed random number seed, which is used to break ties between equally good alignments. My explanation is that in different computers, given the same seed, the random number generator can produce different results.
What do you think?
Yes, this is known behaviour. See for solutions:
https://github.com/lh3/bwa/issues/192
https://github.com/lh3/bwa/issues/272
https://github.com/lh3/bwa/issues/121
Yea I am aware of these, but they have different circumstances, either reordering of the reads or different number of threads. In my instance, the FASTQ files are identical (no shuffling) and the number of threads is same.
Ah, sorry I should have read better. Anyway, I can confirm that sometimes different machines cause different output. I personally had this with quantification and analysis of single-cell RNA-seq where running things 100% identical on a Mac and a Linux workstation causes slight differences. We eventually tracked this down to a machine precision problem, maybe it's the same here.
Many NGS programs produce results that are non-deterministic. This generally isn't a problem. If you must have results that are identical then you will need to use a program like
bbmap
that supports adeterministic
option at run-time.Brian Bushnell may have a programmers insight on why multi-threaded runs produce non-deterministic output.