Hi again,
Since I didn't find any useful results out there on the interweb I did a little investigation of my own.
I got my hands on 5 million 250bp Miseq reads pairs generated from an HL60 cell line. That amounts to 10 million 250bp reads.
I ran two aligners:
bwasw from bwa 0.6.2
bowtie2 release beta7
Both were used with default parameters (except for telling bwasw to report only 1 alignment per read) to align the reads to hg19.
aligner bwasw bowtie2
version 0.6.2 2 beta 7
Bam_path_reference hg19 hg19
#Reads 10M 10M
%ChastityFailed 6.63 6.63
%duplicates 0.021 0.229
%aligned 97.393 92.811
%paired 92.79 96.817
%uniqueAligned 96.24 98.268
mean_insert_size 1005 349
coverage_estimate 0.797583 0.758483
Some things of note:
- BWASW did have a lower fraction of the reads that aligned aligning in proper pairs. Also, of the "proper pairs" that were identified some were very far apart, which is what caused the mean insert size to spike like it did. The expected insert size is on the order of 350, so bowtie2 is much closer to the target.
- BWASW had the unusual trait of changing the base qualities in some reads. I observed this in many reads where the qualities of mismatched bases in the alignment had higher base qualities than were reported in the fastq file. My understanding is that an aligner should not alter the qualities of the bases in the SAM report.
- I have to look deeper to figure out how bowtie2 manage to find such an increase in dups. I suspect it has something to do with how the ends of the alignments are trimmed in the two aligners, but I need to confirm that.
-BWASW aligned more reads overall
Although it doesn't make a whole lot of sense given the low coverage data, I tried running some snp calling on the alignments for comparison. I used mpileup from samtools 0.1.16 on each bam, followed by vcfutils, and throwing out variants with qualities less that 20.
aligner number snps called concordant re. dbsnp132
bowtie2 355438 0.8172986569
bwasw 389817 0.794103387
So, it looks like the bowtie2 alignments can identify snp calls that are more specific with respect to dbsnp132. However, the number of true positive snps identified was higher in bwa (not reported directly above).
Moving forward I'll need to get my hands on some deeper data that would help with the snp calling test.
Any other thoughts are appreciated.
Bowtie2 requires full-length match by default. If you use "--local", bowtie2 will align more reads. At my hand for 100bp reads, bowtie2 gives more SNP calls but slightly lower ts/tv - the contrary of your last table. I do not know if this is caused by not using "--local". Overall, the two mappers are comparable. I admit bwa-sw has some bugs in the SAM output. I am writing a new algorithm that is supposed to be faster and more accurate than these two.
Yuck. How does one post tables in this forum?