I have 30 samples worth of unstranded RNA-seq data. I mapped with STAR on each of these samples and quantified with rsem-calculate-expression in RSEM based on the resulting bam files.
The **.genes.results file for each sample showed that the same gene_id had different lengths and effective_lengths.
For example, the following two samples are shown below.
Sample_id gene_id length effective_length
A ENSG00000000003 3347.53 3141.32
B ENSG00000000003 3096.99 2884.81
I think that the rsem-calculate-expression of RSEM is calculated based on the above length, but is it reasonable to merge TPMs obtained by calculating different gene lengths for same gene through the 30 samples? I would like to implement hierarchical clustering, etc. using merged TPMs. I would appreciate your guidance.
Yes. This why you use RSEM, instead of calculating something yourself. It is smart enough to account for the abundances of transcripts of different lengths.
Thanks to you, I now know why I should use RSEM.