Both BWA and Bowtie2 are used to map reads to genomes pretty extensively with 2x150 bp or 2x250 bp Illumina data with insert sizes in the range of 400-1000 bp these days.
The vast majority of datasets are also paired end 2x150 bp datasets, at least in microbiology. Yet, I have found that the default parameters for these programs are not necessarily appropriate for the standard Illumina data coming off of (almost everyone's?) HiSeqs and NovaSeqs for the past few years.
For example, the default -X
(capital X) parameter for bowtie2 is expected insert size and is 500 bp. However, for most modern Illumina datasets the insert sizes can be substantially larger than this a large fraction of the time. Running with the default -X 500
will therefore flag read pairs outside of this insert size as improper (which will be completely ignored by samtools mpileup by default!), and presumably be less likely to accurately assign their location.
This led me down trials and tribulations of trying to figure out exactly how both bowtie2 and bwa decide where to place two paired reads, and ending up concluding that the information largely doesn't exist on the internet in a form understandable by all of the thousands of biologists who run these programs almost exclusively on paired read data.
So, can anyone explain how both programs work at assigning the location of a pair of reads? Some questions that would need to be answered by a proper explanation:
- If read 1 maps equally well to N positions, is it randomly assigned a position?
- If the aligner then finds that read 2 maps perfectly well to a position nearby only one of those N positions, will it go back and adjust the position of read 1 to the position that both reads are best aligned to?
- If read 1 maps perfectly to a position where-as read 2 maps equally well to two positions, will it be placed at the position closer to read 1? How do parameters like the
-X
parameter affect this? - Will the aligner ever choose a (valid) alignment with suboptimal identity for read 1 in order to make it closer to read 2 (and vice versa?)
- Do the mapq scores of reads in a pair influence each other?
Sometimes these sort of questions is best answered by preparing some test cases and see what happens. For example, write a dummy genome of 1000bp with some repeats, write a pair of fastq files with a handful of reads mapping to or near the repeats, index, align and see the results. (In theory, a conclusive answer would come from inspecting the source code, but good luck with it!)