Interpretation of "standardized expression matrix"?
1
0
Entering edit mode
4.5 years ago
n,n ▴ 370

I've seen the term been used in some papers working with gene expression data. I assume they refer to performing z-score normalization on the expression matrix, but I would like to know if this is the right interpretation. Also, is this typically done over each gene vector (rows of a traditional expression matrix) or over the samples? (columns). Another question I have is if it is always done one way or if it depends on the downstream analysis that we want to perform. For example, I've been encountering the term in co-expression papers, sometimes they also refer to this as "zero centering the expression matrix". What about if you want to do PCA, I think in R the function prcomp by default performs the normalization on the columns, but could you in some situations do it over the rows before PCA?

RNA-Seq normalization • 1.5k views
ADD COMMENT
2
Entering edit mode
4.5 years ago

Generally, yes, it can be regarded as meaning Z-scaled. The Z distribution is also referred to as the 'Standard Normal' distribution, and it proves quite a useful transformation to make in various parts of biological data analysis due to how 'readily-interpreted' are the numbers from the distribution:

normalstandard

[source: https://mathbitsnotebook.com/Algebra2/Statistics/STstandardNormalDistribution.html]

So, if we process some single cell RNA-seq and eventually transform our data to Z-scale, a gene with Z > 3 in a particular group of cells is 3 standard deviations above the mean expression of this gene across all cells, and this is statistically significant. 5% alpha (p=0.05) on a two-tailed distribution is equivalent to absolute Z = 1.96.

prcomp(), by default, centers the data column-wise to have mean at roughly 0 - it does this by simply subtracting the mean of each column from all values in its respective column. However, prcomp(), by default, does not scale the data by diving by the standard deviation, which is what would bring it ultimately to a Z distribution, but this can be activated by simply selecting:

prcomp(x, center = TRUE, scale = TRUE)

prcomp() just uses the scale() function 'under the hood'; so, you could look up that function. It's used a lot for heatmaps - take a look at my proof here: A: cannot replicate the pheatmap scale function

Kevin

ADD COMMENT
0
Entering edit mode

Thank you once again for your answers Kevin, what still confuses me is that PCA does the transform column wise while the example you mentioned of scRNA-seq would do it row wise to transform each gene across all cells. Is it the same doing transform column or row wise? I would say no intuitively, but I understood that it is the same from your answer.

ADD REPLY
0
Entering edit mode

It would not be the same to scale row-wise or column-wise. However, note that when we use prcomp(), we virtually always supply the rotated (transposed) input data so that it is ultimately the genes that are scaled.

ADD REPLY
0
Entering edit mode

but if for some analysis you wanted to do PCA with the samples as the features, would it be ok to do the z-score transformation row-wise (genes) and then again over the columns (samples) right before PCA? For example, some correction techniques have been tested for coexpression analysis in which you do PCA like this and then you regress gene expression with the loadings of the samples as the independent terms in the regression; you proceed to coexpression calculation afterwards. I've seen this in papers but it is not explained in detail if genes are standardized and then PCA is performed with scaling over columns additional to that or if its performed without scaling.

ADD REPLY
0
Entering edit mode

There is no right or wrong, and, technically, one does not have to standardise anything prior to performing PCA. Methods are almost always lacking in published works, too

ADD REPLY
0
Entering edit mode

Maybe I'm getting confused and scaling for PCA is just something done as part of the procedure and standardizing for expression matrices is something unrelated...

ADD REPLY

Login before adding your answer.

Traffic: 2480 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6