Hello, I've noticed that bwa mem outputs results to be inconsistent under following setting. I'd have expected this to be not the case, and would appreciate if someone can comment on why this is happening. I have tested this with both bwa mem v0.7.5 and 0.7.15.
Steps to reproduce:
1. Create multiple FASTQ paired-end files from a reference FASTQ set of files. So, if there are 10000 read in our reference FASTQ files (s1.fastq, and s2.fastq), I generate another set of fastq files which also have the same 10000 reads but in a different order. I used the following script to do so.
#!/usr/bin/env python
import Bio
import Bio.SeqIO
import sys
import random
file1 = sys.argv[1]
file2 = sys.argv[2]
record1 = []
record2 = []
for s in Bio.SeqIO.parse(file1,"fastq"):
record1.append(s)
for s in Bio.SeqIO.parse(file2,"fastq"):
record2.append(s)
# Lets make sure both files have same number records
assert(len(record1) == len(record2))
rand_idx = random.sample(xrange(len(record1)), len(record1))
r1 = open("reorder1.fastq", "w")
r2 = open("reorder2.fastq", "w")
for idx in rand_idx:
Bio.SeqIO.write(record1[idx], r1, "fastq")
Bio.SeqIO.write(record2[idx], r2, "fastq")
r1.close()
r2.close()
print("Wrote %d records" % len(record1))
This will say output the following files:
$ ls
reorder1.fastq reorder2.fastq s1.fastq s2.fastq shuffle_fastq.py
The original files were s1.fastq, s2.fastq. The newly generated shuffled files are reorder1.fastq, reorder2.fastq. I made sure reorder1.fastq has all the entries in s1.fastq (and the same for reorder2.fastq/s2.fastq) using this:
#!/usr/bin/env python
import Bio
import Bio.SeqIO
import sys
file1 = sys.argv[1]
file2 = sys.argv[2]
results = []
records = {}
for s in Bio.SeqIO.parse(file1,"fastq"):
records[s.id] = s
for s in Bio.SeqIO.parse(file2,"fastq"):
if not records[s.id].seq == s.seq:
results.append(s)
print "Unmatched records", len(results)
2. Run BWA mem on the output FASTQ file pairs r1.fastq/r2.fastq and s1.fastq / s2.fastq
$ bwa mem -R "@RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" /share/BWA_Index/hg19.all_chr.fa s1.fastq s2.fastq > s.sam
$ bwa mem -R "@RG\tID:foo\tLB:bar\tPL:illumina\tPU:illumina\tSM:ERR000589" /share/BWA_Index/hg19.all_chr.fa reorder1.fastq reorder2.fastq > reorder.sam
# Sort sam files so we can compare them with diff
$ cat reorder.sam | samtools view -Sb - | samtools sort - reorder_sort && samtools view reorder_sort.bam > reorder_sort.sam
$ cat s.sam | samtools view -Sb - | samtools sort - s_sort && samtools view s_sort.bam > s_sort.sam
3. Compare the output sam files s.sam and r.sam, and I see different alignments!!
diff reorder_sort.sam s_sort.sam
Is anyone aware of why this could happen? I've tried this experiment several times with the following results:
Total# FASTQ Reads.... Inconsistent Alignments
2500 135
12500 660
125000 6830
Take a look at: bwa mem multiple alignments doesn't work Output of NGS aligners may be non-deterministic.