We can use edgeR to normalized raw read counts with TMM method, but how can we normalized TPM with the same method?
We can use edgeR to normalized raw read counts with TMM method, but how can we normalized TPM with the same method?
I don't quite follow your question. TMM is a method for normalizing the library sizes rather than a method for normalizing read counts. As the edgeR User's Guide says (page 15):
normalization in edgeR is model-based, and the original read counts are not themselves transformed.
Which way around is your question? Do you have TPMs and want to compute TMM factors or do you have TMM factors and want to compute TPMs?
If you are asking the first question, then no, TMM factors can only be computed from the raw counts, not from quantities such as TPMs or CPMs from which the library sizes have already been divided out. If you already have TPMs from some software package, then normalization has almost certainly already been applied, so I would be very wary about trying to re-normalize them unless you really know what you're doing.
If you are asking the second question then, yes, TMM factors can in principle be used to compute TPMs. In edgeR, any downstream quantity that is computed from the library sizes will incorporate the TMM factors automatically, because the factors are considered part of the effective library sizes. TMM normalization factors will be applied automatically when you use
CPM <- cpm(dge)
or
RPKM <- rpkm(dge)
in edgeR to compute CPMs or RPKMs from a DGEList object. I don't necessarily recommend TPM values myself, but if you go on to compute TPMs by
TPM <- t( t(RPKM) / colSums(RPKM) ) * 1e6
then the TMM factors will naturally have been incorporated into the computation.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Dear Gordon Smyth - would this be the correct way to calculate TMM-normalized logTPM incorporating edgeR's use of prior.count (assuming it's better than doing log2(TPM + 1))? And by the way not meaning to use this for DGE but for analyses where having both inter- and intra-sample normalization performed is important.
For those who want to test this, you can verify the above fpkm code is correct with your data:
Hi, thanks for this information. Could you please help me understand the first step?
I assume counts has row names the gene_id, followed by each individual as a column, but how about the gene length on the genes argument?
The non-overlapping exon length of each gene, easily calculated with a couple lines of code using GenomicFeatures package
Extract Total Non-Overlapping Exon Length Per Gene With Bioconductor
thanks for the rapid response, I have already extracted gene_length and GC content from bioMart in this format:
but when I try to input as:
dge <- DGEList(counts=unrom_counts, genes=data.frame(r_xtract$length))
then I remove the gc column
So the question is purely technical, what format should have the gene argument to match the counts?
Thanks again! :)
You should make sure the rows of your count matrix (gene rows x sample cols) match the gene lengths vector…
you're a boss! thanks so much!
Would that not be an idea if you wanted to do a inter-library normalization of of the TPM values? Kinda like what's happing under the hood of edgeR::cpm() which allows the CPM values to be normalized according to the TMM normalization factors - just for TPM values?
TMM normalizes library sizes and can only be applied to counts. Any downstream quantity such as CPMs or TPMs that are computed from library sizes will obviously incorporate the TMM normalization, but that is not the same thing as trying to estimate the TMM factors from the CPMs or TPMs.
In my opinion, normalization should be done prior to the computation of TPMs but, even if you wanted to re-normalize TPMs for some reason, one would not input them to the
calcNormFactors
function in edgeR, which is how I interpreted OP's question.Or perhaps I have mis-interpreted the question. Perhaps you and the OP are actually asking a simpler question about whether TMM-normalized library sizes can be used to compute TPMs. That is of course true, and I have edited my answer now to say that.
But is the main goal of the inter-library normalisation not simply to make sure the expression/abundance distributions from multiple samples are comparable? I agree that it would be a nicer approach to re-calcuate the TPM values from the TMM normalized counts. But if we ignore that possibility for a moment I'm not sure I see why inter-library normalisation of TPM values are a bad idea - would it not just produce abundance estimates that are more comparable across conditions? Or does TMM normalisation itself assumes something about the data which does not hold true for TPM values?
Just out of curiosity I tried it and it does look like there is some improvement on a dataset I already had open as can be seen here. Naturally this is not a "bad" dataset which needs it - but sometimes it could be needed (as TPM normalization does not account for e.g. different levels of rRNA contamination).
I don't know what you are referring to when you say that you "tried it", or what type of data you have applied it to, or what you mean by "some improvement". I am not going to continue this discussion.
I have tried to answer OP's question as completely as I can. I have nothing to add to what I have already written.