In my RNA-seq analysis, I use hisat2 to map my clean RNA-seq reads to the genome. Then, I use FeatureCounts to get the read count matrix of 'gene' in my gff file. Then I perform DE analysis using deseq2.
I noticed that some people use stringtie to assemble the transcripts after hisat2, then perform DE analysis. I feel that:
1. If I skipped transcriptome assembly, what I am actually counting would be the abundance of the PE reads/fragments mapped to the 'gene' in the gff file.
2. If I perform transcriptome assembly, what I actually count would be the abundance of the assembled transcripts derived from the 'gene' in the gff file.
Well, it seems that what I've been doing is wrong... I feel that the abundance of the reads/fragments does not reveal the true level of gene expression since it's also dependent on gene length. For instance, if I have one gene with one exon that expressed once only. Assume the RNA-seq read length is 150bp
Case1: If the gene is 150bp, I would have one read/fragment mapped to it.
Case2: If the gene is 300bp, I would have two mapped to it
Read count would be two in case2 but actually it's only one transcript, just longer than that in the first case. If transcripts are assembled, the gene would be counted as one in both cases.
But I do see a lot of ppl using the same pipeline as I do (Hisat2, FeatureCounts, Deseq2). I wonder if anyone could help clarify this?
If you are working with well-annotated model organism and you are not specially interested in new transcripts then no, not necessary, and I think even not recommended as you probably have a higher chance of getting false-positives than actual new and meaningful new transcripts.
Hi. But like I mentioned, If assembly is not done, the count number I get would be the abundance of reads mapped to the gene feature right? Then that would be length dependent since longer transcripts would simply have more reads regardless of expression level.