In calculating z-scores for microarray or RNA-Seq data, I have found two main answers on how to obtain them.
For example, in R
, having a log2 expression matrix x
with genes in rows and samples in columns, I would do:
zscore <- function(x) {
z <- (x - mean(x)) / sd(x)
return(z)
}
But many often suggest to use the scale
base R function, on the transposed matrix. Like
mat_zscore <- t(scale(t(x)))
If I am not wrong, the two approaches are different, that is, in the first one I am subtracting population mean and dividing by population SD, while the second one operates by column by default, so transposing is done to calculate mean and SD for each gene in row.
My question is, is one of the two more correct than the other? And why are both given as valid alternatives?
Thanks
My question was more like: "Is it better to scale by global or by gene mean and SD?"
Can you show an example where global mean and global sdev were used?
How transform FPKM values to Z-score using R
Or you mean an article?
Both answers in that thread are old, and the answers by Seán and dariober are different, as you have also highlighted in your question.
The
scale()
function will always scale by column, only (you can get it to scale by row by doingt(scale(t(x)))
); so, each column in the data is scaled separately. This may be more favourable in certain situations, e.g., for visualisation. However, I have never seen a comprehensive review of why one would be more favourable over the other. You may receive a better answer by posting on Cross Validated.You may be interested in this thread: https://stats.stackexchange.com/questions/201961/do-i-apply-normalization-per-entire-dataset-per-input-vector-or-per-feature