RNA-seq z-score normalization prior to clustering
2
2
Entering edit mode
8.3 years ago
acorella ▴ 30

Hi,

I would like to perform unsupervised hierarchical clustering on some RNA-seq data, but I was told I need to normalize the data by z-score per gene.

My question is: what type of RNAseq data should z-score normalization be performed on? Is it better to do the normalization on RPKM, CPM, log2 CPM, etc?

I typically represent my RNAseq data as mean-centered log2 CPM: Can I perform z-score normalization per gene on mean-centered log2 CPM? Or is this not advised?

Thanks!

RNA-Seq normalization • 14k views
ADD COMMENT
0
Entering edit mode

@acorella Can you define the z-socre? is it the z-score normalisation that for each element of a given data as such that e.g. a vector of expression is centered to have mean 0 and scaled to have standard deviation 1? After checking , I came across this post. I believe this is your answer TCGA: What are mRNA expression z-scores? Does TCGA have mRNA expression from controls?

ADD REPLY
0
Entering edit mode

Hi, yes that is the z-score I am referring to, however that post does not answer my question.

My question is, is it appropriate to z-score normalize mean-centered log2 CPM values? Does it matter what type of RNAseq values (CPM, RPKM, TPM, etc) I use to z-score normalize?

ADD REPLY
0
Entering edit mode

What is the structure of your data, e.g. #samples, #conditions, #replicates/condition? And are you attempting to find clusters of samples? Or genes? Depending on what you are interested in, using z-scores may not be necessary.

ADD REPLY
5
Entering edit mode
8.3 years ago
thomas.smith2 ▴ 120

Hi Acorella,

The Z-score normalisation only really makes sense if the expression values for a given are (approximately) normally distributed. One would expect RPKM to be approximately log normally distributed and CPM to be approximately negative binomially distributed, assuming CPM = counts per million? If you wanted to work from the RPKm or CPM, I'd suggest using the log RPKMs - I'm not sure what to expect from log CPMs?

However, I think you'd be much better off using transcripts per million (TPM) as your unit of expression (see Question: the problem with rpkm (and tpm), and What the FPKM? A review of RNA-Seq expression units). The second link also explains how to convert from RPKM/FPKM to TPM. log(TPM) will be approximately normally distributed and suitable for calculating z-scores.

ADD COMMENT
0
Entering edit mode
8.3 years ago
Ron ★ 1.2k

Log2CPM can be used to do unsupervised clustering.This should work.

ADD COMMENT

Login before adding your answer.

Traffic: 1634 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6