Entering edit mode
8.3 years ago
nickp60
▴
60
How do the different mappers handle reads extending past the end of a reference sequence? I am unclear as to how BWA, Bowtie2, or <#insert favorite mapper> scores sequences which map t the end of a chromosome but also partially extend past the reference. In particular, I am interested in how to score a reads mapped to the 'end' of bacterial genomes (in reality, spanning the bacterial origin), when the genome must be represented as linear in the typical .fasta.
See this thread and top rated answer: Circular Genome?? This behavior may still be current though I have not specifically looked. You may also see this as soft-clipping at the end/beginning of reads.
My concern is that there is a strong possibility (without knowing how the mappers handle such reads) that the bridging reads will be underrepresented or rejected.
For most aligners you should expect a complete drop in coverage as you approach the edge of chromosomes/contigs. You might be able to get bwa mem to do the back splicing if your reads are long enough (it'll do is with supplemental alignments), but in general aligners are tailored toward mammalian chromosomes.
Here is one solution posted on SeqAnswers.
Quick Follow-up: I contacted the BWA package author, and he confirms that these overhangs are treated as clipping. I have not had any luck capturing more reads by changing/removing the clipping penalty, so it would be interesting to hear if anyone has found a way to do that.
A cursory run through with SMALT showed noticeably increased recovery of overhanging end reads, but I haven't done a very thorough benchmarking. Thanks all for your input; has anyone else found this limitation of BWA problematic?