RSEM uses different gene lengths for each sample.
1
1
Entering edit mode
15 months ago
Apprentice ▴ 170

I have 30 samples worth of unstranded RNA-seq data. I mapped with STAR on each of these samples and quantified with rsem-calculate-expression in RSEM based on the resulting bam files.

The **.genes.results file for each sample showed that the same gene_id had different lengths and effective_lengths.

For example, the following two samples are shown below.


Sample_id gene_id length effective_length

A ENSG00000000003 3347.53 3141.32

B ENSG00000000003 3096.99 2884.81


I think that the rsem-calculate-expression of RSEM is calculated based on the above length, but is it reasonable to merge TPMs obtained by calculating different gene lengths for same gene through the 30 samples? I would like to implement hierarchical clustering, etc. using merged TPMs. I would appreciate your guidance.

STAR RNA-seq RSEM • 1.3k views
ADD COMMENT
4
Entering edit mode
15 months ago

RSEM gives the effective length, rather than the absolute length. It is expected that genes will have slightly different effective lengths in different samples.

There is really no such thing as a gene. What exists in reality is a collection of related and overlapping transcripts someone has decided constitutes a gene. The balance of which transcripts are expressed from a gene differs from sample to sample. RSEM quantitates the expression of transcripts, not genes.

Lets say that a gene has two isoforms, one is 100bp and the other is 500bp. If in Sample A, only the short isofrom is expressed, where as in sample B the long isofrom is expressed, then the length of the gene is different in the two samples, and it wouldn't be fair to divide the read counts by 500 in both samples.

This is an extreme example, RSEM quantitates both transcripts in both samples, and then calculates a weighted mean transcript length for each sample. This is length_sample_A = (counts_isoform_1*length_isoform_1 + counts_isoform_2*length_isoform_2)/(counts_isoform_1 + counts_isoform_2), and uses this to calculate the TPM.

You asked if tis valid to use these TPMs in heatmaps etc... There is not a straight forward answer to this, but basically, if you want to use gene expression, then this is a fair as it is possible to be.

ADD COMMENT
0
Entering edit mode

The **.genes.results file for each sample showed that the same gene_id had different lengths and effective_lengths.

Yes. This why you use RSEM, instead of calculating something yourself. It is smart enough to account for the abundances of transcripts of different lengths.

ADD REPLY
0
Entering edit mode

Thanks to you, I now know why I should use RSEM.

ADD REPLY

Login before adding your answer.

Traffic: 1920 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6