Question

z score transformation by population or by gene?

0

Entering edit mode

6.2 years ago

Pietro ▴ 240

In calculating z-scores for microarray or RNA-Seq data, I have found two main answers on how to obtain them.

For example, in R, having a log2 expression matrix x with genes in rows and samples in columns, I would do:

zscore <- function(x) {
z <- (x - mean(x)) / sd(x)
return(z)
}

But many often suggest to use the scale base R function, on the transposed matrix. Like

mat_zscore <- t(scale(t(x)))

If I am not wrong, the two approaches are different, that is, in the first one I am subtracting population mean and dividing by population SD, while the second one operates by column by default, so transposing is done to calculate mean and SD for each gene in row.

My question is, is one of the two more correct than the other? And why are both given as valid alternatives?

Thanks

z score RNA-Seq microarray transformation • 9.2k views

ADD COMMENT • link updated 6.2 years ago by Kevin Blighe 89k • written 6.2 years ago by Pietro ▴ 240

score 1 · Answer 1 · 2019-05-29

1

Entering edit mode

6.2 years ago

Kevin Blighe 89k

They should give the same values. Here is my proof, taking functions from pheatmap() and heatmap.2(), and comparing them to scale(): cannot replicate the pheatmap scale function

Keep in mind that we also either scale by row or by column. Your function is scaling by the global mean and global standard deviation. In a typical setting for a transcriptomics study, scale(t(x)) will scale by row.

Kevin

ADD COMMENT • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

My question was more like: "Is it better to scale by global or by gene mean and SD?"

ADD REPLY • link 6.2 years ago by Pietro ▴ 240

0

Entering edit mode

Can you show an example where global mean and global sdev were used?

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

How transform FPKM values to Z-score using R

Or you mean an article?

ADD REPLY • link 6.2 years ago by Pietro ▴ 240

0

Entering edit mode

Both answers in that thread are old, and the answers by Seán and dariober are different, as you have also highlighted in your question.

The scale() function will always scale by column, only (you can get it to scale by row by doing t(scale(t(x)))); so, each column in the data is scaled separately. This may be more favourable in certain situations, e.g., for visualisation. However, I have never seen a comprehensive review of why one would be more favourable over the other. You may receive a better answer by posting on Cross Validated.

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k

0

Entering edit mode

You may be interested in this thread: https://stats.stackexchange.com/questions/201961/do-i-apply-normalization-per-entire-dataset-per-input-vector-or-per-feature

ADD REPLY • link 6.2 years ago by Kevin Blighe 89k