I have to disagree with kristoffer.vittingseerup on that matter. TPM, like all methods that are entirely based on per-million scaling without further corrections suffer when it comes to correction for changes in library composition. This is an issue in "normal" RNA-seq when simply comparing the same cell type of the same species between conditions (like a certain treatment). It is most likely even more an issue when you compare species as between species you might have gains or losses of genes, notable expression differences, changes in gene length etc, so I do expect a notable change in composition. TPM does not account for this. It is good to compare transcript composition within a single sample. The thing is that the common normalization techniques like TMM
from edgeR
, RLE
from DESeq2
or transformations such as vst
and rlog
all assume that most genes do not change, which I at least find questionable between species, both towards the biological reality and the completeness of the reference transcriptomes which might further impose technical difficulties.
I am surprised that these naive per-million methods are still in use and recommended even by more experienced folks as there is plenty of literature out these that recommends against it. Some brief examples (there are much more benchmarking papers on this out there):
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6171491/
As shown in Table 2, in many comparisons, Total Count and RPKM/FPKM perform worse than all other methods, and several authors expressly recommend against its use [9].
https://academic.oup.com/bib/article/14/6/671/189645
(...) TC and RPKM do not improve over the raw counts
If you search Biostars and the web for opinions on RPKM/FPKM/TPM for meaningful differential between-sample analysis you see that the statistics community strongly argues against it. The Bioconductor support page is full of threads towards this and the package maintainers of the established tools typically recommend against it.
As for your question:
=> I would definitely survey the literature for dedicated approaches that tackle all these points,
e.g. SCBN (https://bioconductor.org/packages/release/bioc/html/SCBN.html) and not try and naive/ad-hoc methods as these might give you skewed results.
I understand that most statisticians argue against TPM etc, but the thing is, what other metric would one use? There are cases when one cannot rely on a downstream algorithm to account for other variables, so until we get to an alternative metric - an actual solution - TPM might have to do as the best possible estimate.
I am looking at a middle ground right now - load up raw counts into DESeq2 with the design formula, let it calculate size factors, etc, then export
counts(..., normalized = TRUE)
. If this can be turned into TPM, we would account for library size, batch effects as well as gene length.I do not see how this would compensate for batch effects.
DESeq2
by default does not correct for this. Of course you can try, and you even should, but be sure to make exploratory MA-plots to check if there is evidence that indeed the assumption of most genes not changing is not violated.When is the batch effect accounted for in DESeq2? I thought that since the design formula (which includes the batch as a covariate) is specified when the dataset is imported, all operations would take the batch effect covariate into account. Is that assumption mistaken?
You can include batch in your design formula in DESeq, but that's just another variable to DESeq. It doesn't treat it specially to adjust for batch effects. The normalization, I believe, is independent of the design formula you give. It's only in the calculation of DEG p-values that it uses the formula to adjust for batch effect, or any other variables.
Thanks for the interesting discussion. Since my only goal is just to look at clustering of the samples, I'm wondering if quantile normalization and PCA, or even some kind hierarchical clustering using Spearman correlation would avoid many of these pitfalls, since I think in those cases the relative differences between genes within a sample is what would determine the clustering, rather than the actual differences in the value of expression between samples.
I'm sorry about hijacking your post. Clustering should be an easier topic to tackle than unifying datasets in a comparable manner.
I completely agree and did try to highlight the need for an inter-library normalization - now highlighted more. Seems like SCBN requires knowledge of housekeeping genes (aka genes not changing) - is that not a larger assumption than most genes are not changing (or that the amount of up and down regulation is balanced)?