Hi BioStars,
I have two questions about using TPM (transcripts per million). I've read some papers on the calculation and some blog and forum posts so I have some understanding of what it is. The true analysis for this experiment was with raw counts and vst expression values, and I'm basically just having a look at TPM out of interest.
My questions:
1. Is it valid to calculate TPM from DESeq2's normalised counts, i.e. counts(dds, normalized = TRUE)
, or do I have to use the raw, raw counts? I tried both and there didn't seem to be a great deal of difference (actually my TPM results aren't that different to using normalized raw counts for the genes I've looked at, in either case) but I haven't tested it thoroughly.
2. I understand why one shouldn't compare TPM between samples, since the total expression rates, rRNA component etc. varies sample-to-sample. I'm just wondering if this would be less of a problem in the case where data from three biological replicates were available?
Thanks for reading and have a nice Friday,
Tom
Hi Karl,
Thanks for the reply.
I'm sorry if my first question wasn't clear. I realise TPM is not read count—I manually calculated TPM from normalised read count (and, separately, from raw read count) using the gene lengths from my GTF file. I don't know whether it's valid to use the normalised counts instead of the raw counts in the TPM calculation.
Tom
Hi,
Could you please tell me which is the formula that you use to manually calculate TPM?
I'm getting a little bit confused since I'm trying to find an "unambiguous" one and I found these 3 links, that don't say exactly the same thing.
http://lynchlab.uchicago.edu/publications/Wagner,%20Kin,%20and%20Lynch%20%282012%29.pdf
https://www.biostat.wisc.edu/bmi776/lectures/rnaseq.pdf
I used RSEM to calculate expression, but I need a TPM estimate for a gene that I can't take from the RSEM output (don't ask, it's complicated :) )
in particular, using the formula from the Dewey presentation,
(10^6 * Z * ( C_i/ L'_i * N) )
, I'm trying to understand what exactly Z stands for. it should be a normalization parameter so it should to be the same for all the transcripts (am I right?), but when I try to extrapolate its value from the TPM values of the RSEM output (basicallyZ= TPM_value / (10^6 *c_i / L_i * N)
I get different results for Z (the values oscillate a little bit around a constant number).Hi, the formula I used in R was lifted from here (it's the same as the Wagner paper).
Thanks @biola. I need to normalize my Htseq-Count data based on TPM. I read your code but in my case, I have 20000 genes(rows) and 259 columns(samples). how to apply your TPM function for that matrix?
Sorry, if I have extracted a list of differentially expressed genes by edgeR, does this make sense to use Transcripts Per Million (TPM) normalized data for co-expression analysis????? I mean, firstly, I defined DE genes from raw read counts by edgeR but as I had Transcripts Per Million (TPM) file, I extracted DE genes defined by edgeR from Transcripts Per Million (TPM) file and used for network construction.