We just tested Bowtie2 for reads mapping multiple times in a genome:
1. We took one read with multiple (perfect) mapping loci in the genome (hg19):
FASTQ entry:
@test_sequence
CAAAGTGTTTACTTTTGGATTTTTGCCAGTCTAACAGGTGAAGCCCTGGAGATTCTTATTAGTGATTTGGGCTGGGGCCTGGCCATGTGTATTTTTTTAAA
+
IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII
UCSC - BLAT:
browser details YourSeq 101 1 101 101 100.0% 2 - 114359280 114359380 101
browser details YourSeq 101 1 101 101 100.0% 15 - 102519435 102519535 101
browser details YourSeq 101 1 101 101 100.0% 1 + 11635 11735 101
2. We created a fastq file containing 1,000 times this entry.
3. We started bowtie2 with the following command:
bowtie2 -x hg19 -U test.fq -S test.sam
4. bowtie2 claims in its manual:
By default, bowtie2 searches for distinct, valid alignments for each read. When it finds a valid alignment, it continues looking for alignments that are nearly as good or better. The best alignment found is reported (randomly selected from among best if tied).
5. BUT:
=> For all reads bowtie2 reports the same location (chr1:11635-11735) in the SAM output
=> Using default parameters, bowtie2 does NOT select the reported mapping randomly
Why is that a problem?
Two examples:
- Assume you analyze a small RNA-seq experiment to quantify microRNAs. Now you have an example like hsa-let-7a. The 5' arm of this microRNA (hsa-let-7a-5p) occurs three times in the human genome, while the 3' arm (hsa-let-7a-3p) is unqiue. When you now use bowtie2 for mapping (without knowing the effect shown above), you might think, that your hsa-let-7a-5p reads are randomly distributed over these three loci, normalizing the "expression" in a way. But that is not the case. Instead one of the three let-7 microRNAs gets all reads and you think it is highly expressed. But is it?
- The same problem occurs when doing transcriptome sequencing. It might happen, that all the reads of a gene having pseudogene copy in the genome, map to the latter, resulting in no expression at all. Is it not expressed?
TIP: You can and should force bowtie2 to report all multiple loci (-k or -a parameter). Do not use the bowtie2 default parameter for NGS read mapping!
that's pretty scary. i assume it's the same if you set the seed?
What do you mean by "setting the seed"?
--seed <int> Use <int> as the seed for pseudo-random number generator. Default: 0. Use any number other than zero - what happens?
we did not try that, since the we wanted to use the default parameters.
We just checked it and when setting --seed, the locus switches to another one, but it is still only one single locus for all 1,000 reads! ;)
When you write code that requires a random number selection you typically must provide a seed value to start the pseudo-random number generation. Running the algorithm with the same seed will produce the same "random" output. The time when the program was run (expressed as an integer) is a typical choice of seed. Apparently Bowtie does not modify the seed value between reads once you start alignments, so the same "randomly selected" locus is chosen in your pathological (but not unreasonable) test data.
I wonder if it actually thinks chr1 position is the 'best' position for some odd reason. Is it an error in the randomly selection process or an error in finding best match.
If he's changing the random seed and getting a different location (see above), then it doesn't sound like one match is scoring better than the others.
scary bias that could throw off a lot of downstream expression analysis specially for genomes with high repeat content.