Hi,
I would like to perform unsupervised hierarchical clustering on some RNA-seq data, but I was told I need to normalize the data by z-score per gene.
My question is: what type of RNAseq data should z-score normalization be performed on? Is it better to do the normalization on RPKM, CPM, log2 CPM, etc?
I typically represent my RNAseq data as mean-centered log2 CPM: Can I perform z-score normalization per gene on mean-centered log2 CPM? Or is this not advised?
Thanks!
@acorella Can you define the z-socre? is it the z-score normalisation that for each element of a given data as such that e.g. a vector of expression is centered to have mean 0 and scaled to have standard deviation 1? After checking , I came across this post. I believe this is your answer TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?
Hi, yes that is the z-score I am referring to, however that post does not answer my question.
My question is, is it appropriate to z-score normalize mean-centered log2 CPM values? Does it matter what type of RNAseq values (CPM, RPKM, TPM, etc) I use to z-score normalize?
What is the structure of your data, e.g. #samples, #conditions, #replicates/condition? And are you attempting to find clusters of samples? Or genes? Depending on what you are interested in, using z-scores may not be necessary.