I'd like to run TopHat2 on a 50bp single-end rna-seq dataset in order to get gene counts for differential expression analysis. I was going to run with --GTF ensembl_genes.gtf since the TopHat2 paper talks about how this leads to significant gains in sensitivity and accuracy.
What I'm wondering is - how will overlapping ensembl transcripts effect the results?
When TopHat generates a fasta from my ensembl gene file, it includes multiple overlapping sequences.
It seems like these would lead to ambiguous alignment, and that I need to merge overlaps before running TopHat, but I'm not finding any discussion of this on the forum or in the papers, so wanted to double check.
Thanks -Ben
That is not entirely true. Tophat may mispredict novel splice junctions by choosing a splicing motif from the wrong strand.
TopHat is certainly designed with the problem of overlapping transcripts as well as antisense transcripts in mind, but you are certainly correct that TopHat can and does produce false positive (and false negative) results.