Question

RSEM uses different gene lengths for each sample.

1

Entering edit mode

15 months ago

Apprentice ▴ 170

I have 30 samples worth of unstranded RNA-seq data. I mapped with STAR on each of these samples and quantified with rsem-calculate-expression in RSEM based on the resulting bam files.

The **.genes.results file for each sample showed that the same gene_id had different lengths and effective_lengths.

For example, the following two samples are shown below.

Sample_id gene_id length effective_length

A ENSG00000000003 3347.53 3141.32

B ENSG00000000003 3096.99 2884.81

I think that the rsem-calculate-expression of RSEM is calculated based on the above length, but is it reasonable to merge TPMs obtained by calculating different gene lengths for same gene through the 30 samples? I would like to implement hierarchical clustering, etc. using merged TPMs. I would appreciate your guidance.

STAR RNA-seq RSEM • 1.3k views

ADD COMMENT • link 15 months ago by Apprentice ▴ 170

score 4 · Accepted Answer · 2023-08-07

RSEM gives the effective length, rather than the absolute length. It is expected that genes will have slightly different effective lengths in different samples.

There is really no such thing as a gene. What exists in reality is a collection of related and overlapping transcripts someone has decided constitutes a gene. The balance of which transcripts are expressed from a gene differs from sample to sample. RSEM quantitates the expression of transcripts, not genes.

Lets say that a gene has two isoforms, one is 100bp and the other is 500bp. If in Sample A, only the short isofrom is expressed, where as in sample B the long isofrom is expressed, then the length of the gene is different in the two samples, and it wouldn't be fair to divide the read counts by 500 in both samples.

This is an extreme example, RSEM quantitates both transcripts in both samples, and then calculates a weighted mean transcript length for each sample. This is length_sample_A = (counts_isoform_1*length_isoform_1 + counts_isoform_2*length_isoform_2)/(counts_isoform_1 + counts_isoform_2), and uses this to calculate the TPM.

You asked if tis valid to use these TPMs in heatmaps etc... There is not a straight forward answer to this, but basically, if you want to use gene expression, then this is a fair as it is possible to be.