I have several RNA-seq raw counts datasets that I want to normalize to TPM. I exctracted the transcript lengths from ensemble, this is for example the lengths for one of the datasets, and as you can see the gene names are the rownames:
Now according to this awesome article, it is better to use effective lengths rather than just the transcript lengths, which is the gene length - X + 1
, where X
is "mean of the fragment length distribution which was learned from the aligned read".
My question is how do I calculate that in R?
What exactly do they mean with this X
, and is the gene length the same as the transcript length? I'm a bit confused.