Hi all,
I aligned my RNA-seq against reference genome using tophat, I used the default aligner bowtie2.
And also the default parameters:
tophat -p 8 -G $annotation -o out $database L1_1.fq.gz L1_2.fq.gz
After got the results, I found out that in the unmapped.bam file, some reads have exact same sequences with the reference. The follow is one line in the unmapped.sam file:
DGZN8DQ1:360:H9RN8ADXX:1:1101:4791:1895 69 * 0 255 *
* 0 0 TTTTGCTTTCTGACTCTGTGCTTGTGCCTTCAAGACTTTCACAACGATTTTCTGCTCCTCAATAAGGAAAGCCCGAGATCGGAAGAGCACACGTCTGAAC CCCFFFFFHHHHHJJJJJJJIJJJHIJJJJJIJJJIJJJJIJJJJJIJJJJJJJJJJJJIJIJJJJJIJJJJJJHHFFDEDDDDDDDDDDDDDDDDDCCD
Does anyone know why the bowtie2 doesn't treat those reads as mapped? Thanks
Hi Devon, thank you very much. I just tried mapping using bowtie2 directly instead of tophat, the result increased a little, and I also blast the unmapped reads, most of them mapped to mouse ribosomal RNA.
I didn't change the annotation file, and I made sure there are rRNA reference in the gff file. In this case, the reads should map to the reference, but they didn't.
So my guess it that tophat can filter rRNA reads automatically? Do you have any experience about this? Thank you very much.
Perhaps, but it's more likely that the reads map so many times that they're discarded. There are enough copies of rRNA in the genome that this could be the case. I should add that I don't use tophat anymore, it's just too painfully slow. Give STAR a try if you have enough RAM.
Thank you very much. I will try STAR.