We recently analyze a set of RNA-seq data with Tophat and met some questions about the alignment.
We infected cells with virus and extracted total RNA samples for RNA-seq at different time points. Libraries were prepared with a TruSeq stranded total RNA sample with a Ribo‐Zero prep kit, and the RNA sequencing read type was 2 × 50 bp pair-end sequencing. There is no problem when we analyze the host transcriptome. However, when we analyze the viral transcriptome and mapping the reads to the virus genome using Tophat Version 2.1.1 under default parameters, we found the viral reads mapped to both viral positive and negative strands at the same region. A little bit background about the virus we used: the virus has double stranded circular DNA genome and both strands transcript viral genes but at different region. And we tried default parameters with all three different library types, the results are the same. So it seems like the R1 and R2 reads are simply mapped to the different strand of the virus genome without converting R1 strand information. According to this, I tried to only analyze the R1 reads or R2 reads, it gives me more reasonable results as expected.
Has somebody met this problem or could help me out this problem?
Hello WZOUXM ,
You should know that the old 'Tuxedo' pipeline of Tophat(2) and Cufflinks is no longer the "advisable" tool for RNA-seq analysis. The software is deprecated/ in low maintenance and should be replaced by HISAT2, StringTie and ballgown. See this paper: Transcript-level expression analysis of RNA-seq experiments with HISAT, StringTie and Ballgown. There are also other alternatives, including alignment with STAR and bbmap, or pseudo-alignment using salmon.