I am counting reads with featurecounts after aligning RNA-seq data to Hg38 using STAR. When I run featurecounts with “-t exon” (i.e. using the exon flag in the 3rd column of the GTF file), I get generally poor results, with less than 50% being assigned and greater than 50% being unnassigned_NoFeatures. However, when I specify “-t transcript” (i.e. using the transcript flag instead of exon), I get significantly more assigned reads (maybe around 70%) and much fewer unassigned_NoFeatures.
I am essentially wondering if there’s anything wrong with using -t transcript instead of -t exon. Of course, I understand the theoretical biological difference between a transcript and an exon, but is it so awful to use the transcript rows in the GTF instead of the exon rows? In theory, doesn’t using the transcript flag tell you more about transcription than the exon flag does?
In addition, what could be causing such a huge difference? Using the transcript flag instead of the exon flag gets me double or more than double the number of reads assigned per sample (from ~27 million to ~56 million in one sample, for example. Does this simply mean I have a ton of intronic reads in my data? Or is something else driving this?
As some extra info, I already checked rRNA genes. They seem to be annotated as exons and transcripts anyway, so it’s not them. In case anyone was wondering, I have tried running it unstranded, forward, and reverse. Reverse gets the best results.
Ah, yes. I should have mentioned that I’m pretty sure it was indeed total RNA-Seq, not poly-A selected. And I definitely see what you mean about intronic reads potentially making up a large portion of the reds despite not being particularly plentiful in number. I’m still shocked that my number of aligned reads doubles, but the logic makes sense.
So, in your opinion, then, I should continue on using -t exon to only include exonic reads in my analysis? Or would counting be transcripts and including intronic reads more relevant? I’m just not quite sure which will give me the best, biologically relevant information.
It feels relevant to clarify that downstream, I’ll be doing the standard per-gene quantification and differential expression analysis with DESeq2. I won’t be comparing expression of different genes within one sample. So a bloated count due to intronic reads may not be the worst thing in the world? I’m not sure.
And by the way, thank you very much for your answer!
I guess what I'm saying is that doubling your aligned reads when using transcript rather than exon is entirely within the normal range of expectation. I'm not sure how much work has been done on including intron counts in bulk RNA-seq. I think a fiar amount has been done for single cell RNAseq, I don't remember what the conclusion was. One thing I'd worry about is that you sometimes get whole seperate genes (particularly ncRNAs like miRNAs, snoRNAs and scRNAs) nestled in the introns of larger genes. These genes are quite often highly expressed, and so might account for most of the reads assigned to the containing gene, even if they are not coming from this gene.
Got it. Thank you very much for your thoughts!