Hi all,
I have raw counts of bulk RNAseq data (Ensemble annotated). I am trying to calculate TPM. I understand mathematically how to do it.
What I don’t understand is how to calculate the lengths of transcripts here. I found a package that actually helps to extract lengths in R: https://www.rdocumentation.org/packages/GenomicFeatures/versions/1.24.4/topics/transcriptLengths
They mention that:
The length of a processed transcript is just the sum of the lengths of its exons. This should not be confounded with the length of the stretch of DNA transcribed into RNA (a.k.a. transcription unit), which can be obtained with width(transcripts(txdb)).
When I apply that method I get duplicates of the genes ID due to information about different transcripts. How should I solve that issue?
Should I sum the exons by summing:
tx_len: The length of the processed transcript.
And then collapse genes id?
Is there a simple way to approach this issue?
All the best
May I ask upfront for what you plan to use the TPM which was actually developed to compare transcript expression within the same sample?
Sure, some deconvolution methods require a non-log based transformed data. In addition they suggest TPM for that.
Best
Please have a look at this post
updating comment error resolved after starting R new session
Hi,
Thank you for the prompt reply, it seems that this could solve the issue. The thing is I'm getting an error in this line:
Did you encounter this ?
Best
The raw counts correspond to gene expression estimates or transcript expression estimates?
Hi h.mon,
The raw counts are reads mapped to the genes and they are integers as they are not normalized.