Question

How to calculate the effictive gene length in R?

0

Entering edit mode

2.1 years ago

JACKY ▴ 160

I have several RNA-seq raw counts datasets that I want to normalize to TPM. I exctracted the transcript lengths from ensemble, this is for example the lengths for one of the datasets, and as you can see the gene names are the rownames:

enter image description here

Now according to this awesome article, it is better to use effective lengths rather than just the transcript lengths, which is the gene length - X + 1, where X is "mean of the fragment length distribution which was learned from the aligned read".

My question is how do I calculate that in R?

What exactly do they mean with this X, and is the gene length the same as the transcript length? I'm a bit confused.

r TPM normalization • 746 views

ADD COMMENT • link 2.1 years ago by JACKY ▴ 160

score 3 · Accepted Answer · 2022-10-18

You don't, at least not starting from a count matrix. If you read the linked blog carefully you would see that calculation of effective gene length requires information such as about fragment length distribution which is not available from such a matrix. You need a dedicated tool to calculate that starting from the reads itself, popular options are salmon, kallisto, or RSEM. That having said, you would need to start from scratch with the fastq files. With only counts you cannot achieve that.