I have a data frame with GSVA scores (positive and negative values) for gene modules that are important for my analyses. I want to normalize my data (GSVA scores) prior to calculating PCA, correlations, and also clustering my data based on the GSVA scores. Considering that the sign matters...
I have read multiple threads about the use of clr from the R compositions package. I have also read suggestions about using scale().
I tried both methods and they give very different results and I cannot seem to find the ideal method to normalize GSVA scores prior to running PCA and prior to running correlation or using clustering algorithms. Anyone could advise please?
My GSVA data matrix is: GSVA_subset (which contains positive and negative GSVA scores for modules of interest). My associated metadata matrix is new_meta.
I have tried the following (using clr, then PCAtools)
library("compositions")
library("PCAtools")
clr_mat <- clr(t(GSVA_subset))
clr_mat <- as.data.frame(clr_mat)
p <- pca(clr_mat, metadata = new_meta)
or just simply calculating Z-scores, then PCAtools
GSVA_scaled <- scale(GSVA_subset)
p <- pca(GSVA_scaled, metadata = new_meta)
Is there a better approach?
Thank you for your help
Thank you for your prompt response and for the reading suggestions.
I found this tutorial on obtaining the StandardScaler equivalent of the R-base function scale "with a tweek". https://mahout.apache.org/docs/latest/algorithms/preprocessors/StandardScaler.html
According to the tutorial:
Then I applied the suggestion to my dataset ("use the following form in R to “undo” the degrees of freedom correction"):
N <- nrow(GSVA_subset)
GSVA_scaled <- scale(GSVA_subset, scale= apply(meta_subset, 2, sd) * sqrt(N-1/N))