Question

Principal component analysis of SVA's surrogate variables

1

Entering edit mode

5.6 years ago

RNAseqer ▴ 280

I was wondering if someone could explain to me this property of the surrogate variables values I get from bioconductors sva...I'm very new to the package. Specifically, I'm performing a pca on the surrogate variable matrix from svobj. the principal components I'm getting seem weirdly equal...

I performed sva on a dataset and it estimated that there were 4 surrogate variables.

I downloaded the surrogate variable (sv) matrix from svobj:

    sv1 sv2 sv3 sv4
1   0.170511776 -0.026039142    0.155162179 -0.052378086
2   -0.031146292    0.231081859 -0.119285616    0.020441932
3   0.304738317 -0.114059097    0.056133569 -0.008361104
4   0.384487981 0.222407059 -0.001225998    -0.003543087
5   -0.100784593    -0.076275696    0.013916598 -0.087402628
6   -0.091898903    -0.159580076    0.210261199 -0.042860031
7   0.006998733 0.021321322 -0.018007686    -0.009117072
8   -0.042037192    0.161543154 0.111127593 -0.207275659
9   0.113874692 -0.064348147    -0.102071872    -0.14602898
...

I ran principal component analysis on it (samples as rows, surrogate variables as columns).

sv.pca <-prcomp(sv.mat,scale=TRUE)

And taking a look at my output:

> sv.pca
Standard deviations (1, .., p=4):
[1] 1 1 1 1

Rotation (n x k) = (4 x 4):
            PC1         PC2        PC3       PC4
sv1  0.18920596  0.65779848 -0.6119765 0.3962158
sv2  0.08862912 -0.73074539 -0.4087601 0.5395102
sv3  0.67067428  0.08332028  0.5861903 0.4468049
sv4 -0.71171764  0.16238861  0.3387932 0.5935546

And a scree plot of these Principal components shows each principal component accounts for exactly 25% of the variation. Taking a look at the standard deviations in sv.pca they are all 1, and the pca plots all show a similar buckshot pattern of equal scales.

So why is this? Is this to be expected given what Im looking at are surrogate variables? Or is this a product of the way sva scales its data and divides up the workload of accounting for variance?

I know that its not a smart idea to start thinking "what should happen" and "what should be the case" when doing statistical analysis. But that exactly 25% for each principal component has me thinking I must either be doing something very wrong or have a major hole in my understanding of what I'm working with here.

Here is what I'm thinking, perhaps you can correct my misunderstanding of what is happening in this analysis if thats what this is: It seems to me that if surrogate variables are an adjustment for unknown variables skewing data in some direction or other it would be more likely than not that there would be one set of unseen variables (those contributing to surrogate variable x) that was more potent, even slightly, than others (those in surrogate variable y). I mean if I imagine a study comparing the expression data of patients with chronic anxiety to controls and age, gender and, BMI all happen to be the unknown variables that get accounted for by surrogate variable X when running sva, while dental hygiene, coffee consumption, and literary preference wind up being accounted for in surrogate variable Y... then I'd think that while together sv X and sv Y would account for all of the expression heterogeneity in the study, the values recorded for SV X would be significantly different than those in SVY and plotting values from them in any old PCA, you would probably get different eigenvalues for the principal components. Im not talking about any necessarily extreme difference or strong pattern, I'd just expect there to be SOME difference. I find it weird that my data would have a perfectly balance among all four pc's when i look at the variance between these surrogate variable values. Or have I missed something fundamental here?

I'd be very grateful if someone could shed light on this for me.

sva stdev surrogate variables • 2.8k views

ADD COMMENT • link updated 5.6 years ago by GouthamAtla 12k • written 5.6 years ago by RNAseqer ▴ 280

0

Entering edit mode

What is your goal in running PCA on surrogate variables ?

ADD REPLY • link 5.6 years ago by GouthamAtla 12k

0

Entering edit mode

To identify principal components responsible for their spread. They are like any other collection of datapoints aren't they? Is there any reason you can't perform dimension reduction on them?

And more than anything I'm really just curious as to why this breakdown of four equal principal components happened. The fact that I have no idea why that might be says I've got a large gap in my understanding and I'm trying to fill it.

ADD REPLY • link 5.6 years ago by RNAseqer ▴ 280

score 2 · Accepted Answer · 2019-05-12

2

Entering edit mode

5.6 years ago

GouthamAtla 12k

You can find correlation (cor()) of PCs and sva identified surrogate variables to find which PCs (the percent of variation across samples) captured by the hidden covariates or vice-versa. Not PCs ON surrogate variables. PCs on surrogate variables is not meaningful, to my knowledge.

ADD COMMENT • link 5.6 years ago by GouthamAtla 12k