When I use BWA(version 0.7.15-r1140), I select the BWA-MEM algorithm and here is my command:
bwa mem -t 2 reference_file fastq1.fq.gz fastq2.fq.gz > result1.sam
This works well and then I use this command to improve the threads:
bwa mem -t 10 reference_file fastq1.fq.gz fastq2.fq.gz > result2.sam
This also works well but when I compare the result1.sam with result2.sam, they are different! I also test with -t 6, -t 16
, and all the results are different. However, when I run with the same threads twice, the results are identical. So I found that BWA-MEM will get different results with different threads.
Then I read the source code and found this:
kt_for(opt->n_threads, worker1, &w, (opt->flag&MEM_F_PE)? n>>1 : n); // find mapping positions
for (i = 0; i < opt->n_threads; ++i) smem_aux_destroy(w.aux[i]);
free(w.aux);
if (opt->flag&MEM_F_PE) { // infer insert sizes if not provided
if (pes0) memcpy(pes, pes0, 4 * sizeof(mem_pestat_t));
else mem_pestat(opt, bns->l_pac, n, w.regs, pes);
}
kt_for(opt->n_threads, worker2, &w, (opt->flag&MEM_F_PE)? n>>1 : n);
That is, BWA-MEM use n_threads(such as -t 6, n = 6) to find mapping positions, but use only 1 thread to execute the function mem_pestat
to calculate avg
(average of the insert size) and std
(standard deviation of the insert size), which are important to find pair information. According to BWA, every thread will process around 10000000bp, so:
If I use -t 2, BWA will calculate avg and std with 2 x 10000000bp
If I use -t 10, BWA will calculate avg and std with 10 x 10000000bp
If I use -t 16, BWA will calculate avg and std with 16 x 10000000bp
So I know why the results are different with different threads.
I wonder to know if there is anything wrong with my opinion? If it's correct, I want to know how to evaluate the difference? The difference will change which filed of the SAM record( such as RNAME or POS)? If it's wrong, I want to know the real reason to make the difference.
Any reply will be much appreciated!
Could you explain how the files are different? How did you compare them? Mapping rate, insert size, order of reads, md5sum? This question as asked before, see here, here and here.
Thanks for your comment. I only use
diff
command to compare. Because I read the source code directly so I found that the code above maybe the reason contributed to the difference. According to my search,samtools flagset
is a good tool to compare and I will use it and tell you the result later. If there is any better compare tool, please let me know.diff is not a good tool for this, unless you sort both files. If reads are in different order, but have exactly same mapping, diff will show them as different. Use Devon Ryan tool, or use mapping statistics. Have in mind that even if the files mapped with different thread number are not identical, differences should be minor.
I found that when the MAPQ is big, the main difference is flag. When the MAPQ is small, the difference gets more complicated...