Question

GSVA scores - positive and negative values - normalization prior to running PCA

0

Entering edit mode

2.2 years ago

Bianca ▴ 20

I have a data frame with GSVA scores (positive and negative values) for gene modules that are important for my analyses. I want to normalize my data (GSVA scores) prior to calculating PCA, correlations, and also clustering my data based on the GSVA scores. Considering that the sign matters...

I have read multiple threads about the use of clr from the R compositions package. I have also read suggestions about using scale().

I tried both methods and they give very different results and I cannot seem to find the ideal method to normalize GSVA scores prior to running PCA and prior to running correlation or using clustering algorithms. Anyone could advise please?

My GSVA data matrix is: GSVA_subset (which contains positive and negative GSVA scores for modules of interest). My associated metadata matrix is new_meta.

I have tried the following (using clr, then PCAtools)

library("compositions")
library("PCAtools")
clr_mat <- clr(t(GSVA_subset))
clr_mat <- as.data.frame(clr_mat)
p <- pca(clr_mat, metadata = new_meta)

or just simply calculating Z-scores, then PCAtools

GSVA_scaled <- scale(GSVA_subset)
p <- pca(GSVA_scaled, metadata = new_meta)

Is there a better approach?

Thank you for your help

clr normalization GSVA scale • 1.5k views

ADD COMMENT • link 2.2 years ago by Bianca ▴ 20

score 0 · Answer 1 · 2023-03-20

0

Entering edit mode

2.2 years ago

Mensur Dlakic ★ 29k

PCA works under an assumption that data is normally distributed. The mean of data is zero-centered, so the scale of data can't be [0, 1] or some other arbitrary range. In python this would mean using StandardScaler to normalize the data. There must be a similar function in R, though I don't think it is either clr or scale that you mentioned.

You may want to read some of these explanations:

ADD COMMENT • link 2.2 years ago by Mensur Dlakic ★ 29k

0

Entering edit mode

Thank you for your prompt response and for the reading suggestions.

I found this tutorial on obtaining the StandardScaler equivalent of the R-base function scale "with a tweek". https://mahout.apache.org/docs/latest/algorithms/preprocessors/StandardScaler.html

According to the tutorial:

The StandardScaler is the equivelent of the R-base function scale with one noteable tweek. R’s scale function (indeed all of R) calculates standard deviation with 1 degree of freedom, Mahout (like many other statistical packages aimed at larger data sets) does not make this adjustment.

Then I applied the suggestion to my dataset ("use the following form in R to “undo” the degrees of freedom correction"):

N <- nrow(GSVA_subset)
GSVA_scaled <- scale(GSVA_subset, scale= apply(meta_subset, 2, sd) * sqrt(N-1/N))

ADD REPLY • link 2.2 years ago by Bianca ▴ 20