Hi all,
first all, I am new in bioinformatics, I came from the High Performance Computing world. I am trying the bwa tool to check it performance and I am facing the following problem. If I execute bwa mem paired in sequential mode (this is, without the -t option), then later I execute the program again with the same input and with 8 threads ( -t 8 ) the resulting sam file has a lot of differences with the sequential one. These are all the steps I did.
1.- Download the reference from: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/technical/reference/human_g1k_v37.fasta.gz
2.- Create the index with bwa index -a bwtsw
3.- Download the reads from ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12750/sequence_read/ERR000589_1.filt.fastq.gz and ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data/NA12750/sequence_read/ERR000589_2.filt.fastq.gz
4.- Execute bwa in sequential mode with bwa mem Reference ERR000589_1.filt.fastq ERR000589_2.filt.fastq > ExitSequential.sam (I previously unzipped the input files)
5.- Same as in point 4 but adding the -t 8 option
6.- I made a diff between the exits and I found a lot of differences in the FLAG, MAPQ, RNEXT and PNEXT fields of the sam files
Can somebody please help me with this? Thanks in advance.
The bwa version is 0.7.12-r1039 from github
This sort of behaviour has been reported on and off for the past year or so (see this thread: Bwa Mem Have Different Alignment Result When Using Different Threads ). Keep in mind that if you compare multimapping alignments between the two settings that you shouldn't necessarily expect the same results.
According to the internet, in case of equally good multiple alignments, BWA will pick an alignment randomly. Thus, you wouldn't probably get identical results from two sequential runs either, unless maybe there's a fixed seed for every thread.. I don't know.
I've read the post Devon Ryan wrote. But they talk about a bug in the 0.7.5a version that is already solved, and other people said that they have the same results in the sequential and threaded version. I think these people tried it with a small quantity of reads and not with a big data entry.
In the case that 5heikki said, I've executed the sequential version several times and I've obtained the exact same output. This is because bwa inits always with the same seed (I think it is a 11) and the randomly selected alignment is always the same because the init seed does not change.
This could be the reason why the results are different (each thread inits its own random number generator with the same seed, and after that the random numbers are different in the sequential version and in each thread, I don't know if I am explaining this well), but I am not sure of this.
Exactly, the question is whether the RNG is reinitialized per thread, per alignment, or per run. If it's initialized with a new seed for each alignment then the results should be the same regardless of thread number (presuming the seed is some function of the alignment). In the other two cases, however, different results would be produced.
If you filter out the multimappers, do you still obtain discordant alignments between the two runs?
Yes, if I filter with
samtools view -q1
I still have discordances. Here goes a discordant alignment example:That's not a different alignment, its mate is just aligned differently. Is the mate multimapped?
No, it is not multimapped.
One of the entries is from the sequential version, another one from one launched with 16 threads.
That's not actually the question I asked. I realize that you showed an alignment of the same read from the single and multithreaded run. The question pertains to that reads' mate in each run. The alignment you showed is the same between each run. The differences pertain only to its mate. So, if the mate is multimapping then you'll get this, but it's not a bug or otherwise odd behaviour.
As an aside, this would indicate that the RNG isn't reseeded with each alignment. This makes sense, since that'd be computationally wasteful.