Hi,
I've a RNA-seq count table in TPM (Transcripts Per Million). Now I want to perform an hierarchical clustering analysis.
My first intuition was to log2(TPM+1) transform the data and scale (subtract the mean and divide by the standard deviation), before measuring the Euclidean distance and performing the complete linkage clustering.
Though since the units are the same, gene expression values in TPM, there is no reason to scale in order to minimize/standardize different scales/units. I usually scale, even in heatmaps because the result produces a nice balanced visualization highlighting the samples where each gene is more or less expressed. However, the aim here is different. I aim to see if replicates cluster together.
Therefore, I think that performing Euclidean distance on the original TPM matrix transformed by log2(TPM+1) would be the best approach (somehow similar to what is suggested on edgeR
vignette - 2.16 Clustering, heatmaps etc - it is suggested to use logCPM
counts). Though not 100% sure if there is or not any statistical reason to apply only this or I should scale too.
Any advice about which is the proper transformation to perform hierarchical clustering:
only raw TPM matrix;
transformed log2(TPM+1);
scale the transformed log2(TPM+1).
Thank you for any help or advice. I know there are similar posts on Biostars, but at least I did not found any that does this particular question. If there is any and you could indicate it, I would be glad.
António
I vote for the third option because read counts (even on log2) scale vary greatly between genes regardless of their biological "importance", therefore transforming to the Z-scale will compensate for this issue. I would go for a more sophisticated normalization though, either using
calcNormFactors
followed bycpm()
in edgeR or the DESeq2 implementations ofvst
orfpkm
which all correct for both library size and composition (and some of them like fpkm additionally for gene length). edgeR has arpkm
function as well I believe which uses the TMM size factors.Thank you both for your prompt answers.
So, I'll use scale on the log2(TPM+1) transformed counts.
I understand your point regarding normalization, but I still need to stick with the TPM matrix, that I think is not as good as the others that you mentioned, but still accounts for library size and gene length.
António