How to in R: Include principal components as covariates in analysis
1
0
Entering edit mode
5.6 years ago
RNAseqer ▴ 280

I know that principal components can be included in a model as covariates, but what are the r commands necessary to that actually do this? I have been looking around on the internet for hours and have yet to turn up an example. Does anyone have one?

Suppose you have a factors file, and a datas.pca file. And you want to incorporate the first 2 PC's from .pca as covariates alongside things like BMI, Batch, Age etc in the factors file. What commands are necessary?

Many thanks for the help!

pca covariates principal components • 6.3k views
ADD COMMENT
1
Entering edit mode
5.6 years ago

1, create random data

randomdata <- matrix(rexp(800, rate=.1), ncol = 40)
colnames(randomdata) <- paste0("sample", 1:ncol(randomdata))
rownames(randomdata) <- paste0("gene", 1:nrow(randomdata))

randomdata[1:5,1:5]
         sample1   sample2   sample3   sample4   sample5
gene1  0.3603136  1.159749 3.0826237  3.309323 0.5730092
gene2 35.0652692 20.814076 5.7487437  3.502387 1.2635026
gene3  8.2158417  6.442191 2.1134863 26.724393 9.2087334
gene4 19.6870141  6.617958 5.7245038 10.698933 9.1174246
gene5  6.8046549  5.262233 0.2679569 25.170229 8.3293813

2, create fake metadata for DiseaseStatus and BMI

metadata <- data.frame(DiseaseStatus = c(rep("case", 20), rep("control", 20)), BMI = sample(10:40, 40, replace = TRUE))

head(metadata)
  DiseaseStatus BMI
1          case  12
2          case  12
3          case  18
4          case  20
5          case  27
6          case  32

3, perform PCA and plot bi-plot for PC1 and PC2

pca <- prcomp(t(randomdata))

plot(pca$x[,1:2])

f

Also see my answer here for more 'PCA' plots using base R functions: A: PCA plot from read count matrix from RNA-Seq

4, now prepare an object for regression modeling

modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))

modeling[1:5,1:5]
        DiseaseStatus BMI        PC1         PC2     gene1
sample1          case  12 -18.485678   8.3374667 0.3603136
sample2          case  12  -3.393260  -0.9962191 1.1597494
sample3          case  18   6.579711 -10.1392232 3.0826237
sample4          case  20  13.311610  26.2884376 3.3093231
sample5          case  27  16.998644  -3.0140936 0.5730092

# set control as reference level
modeling$DiseaseStatus <- factor(modeling$DiseaseStatus, levels = c('control','case'))

5, binary logistic regression - adjust for BMI, PC1, and PC2

summary(glm(DiseaseStatus ~ gene1 + BMI + PC1 + PC2, data = modeling, family = binomial(link = 'logit')))


Call:
glm(formula = DiseaseStatus ~ gene1 + BMI + PC1 + PC2, family = binomial(link = "logit"), 
    data = modeling)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.43692  -1.12846   0.01018   1.15101   1.33862  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.266677   1.030321  -0.259    0.796
gene1       -0.009663   0.041650  -0.232    0.817
BMI          0.013031   0.036195   0.360    0.719
PC1         -0.004580   0.018005  -0.254    0.799
PC2         -0.010359   0.020470  -0.506    0.613

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 55.452  on 39  degrees of freedom
Residual deviance: 54.908  on 35  degrees of freedom
AIC: 64.908

Number of Fisher Scoring iterations: 4

6, linear regression - adjust for BMI, PC1, and PC2

summary(lm(gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling))

Call:
lm(formula = gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling)

Residuals:
   Min     1Q Median     3Q    Max 
-8.447 -5.121 -2.338  3.445 31.670 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)  
(Intercept)        8.25736    4.11994   2.004   0.0528 .
DiseaseStatuscase -0.56933    2.60968  -0.218   0.8286  
BMI               -0.04532    0.14733  -0.308   0.7602  
PC1                0.13024    0.06949   1.874   0.0693 .
PC2               -0.11322    0.08107  -1.397   0.1713  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 8.202 on 35 degrees of freedom
Multiple R-squared:  0.1373,    Adjusted R-squared:  0.03869 
F-statistic: 1.392 on 4 and 35 DF,  p-value: 0.2567

-------------------------------------

You should have good justification for wanting to adjust for PCs. Usually, it is performed in genetic association studies for the purpose of adjusting for population stratification.

Don't just adjust for PCs because your supervisor may have told you to do it - question him/her why. One should not add covariates to a regression model without justification.

Kevin

ADD COMMENT
0
Entering edit mode

Will do. As always, thank you for your in detailed and helpful response!

ADD REPLY
0
Entering edit mode

You are very welcome, RNAseqer

ADD REPLY
0
Entering edit mode

Hello Kevin,

I'ts been a many years since I did anything with principal components and over the past few months I've done a lot of brushing up on the subject but I'm just having a bit of trouble understanding what some of their values are in the above covariate analysis.

As I understand it PCA works like this: given a bunch of data-points of different values there will be variance between within the data. If you were to plot your data-points in 2 dimensions you can see the variance in X dimension, and variance in the Y dimension. Plotting in 3 dimensions you can see variance along the x,y & z axis, and in fact there can be variance within the data along many many more dimensions than we can actually see with three dimensions. But, stated crudely, principal component analysis basically works like this: a principal component is an axis or line that can be drawn through the cloud of data-points in any one of those dimensions. If we imagine two dimensional data, the line might be a horizontal one, or one that bisects the cloud of data-points on a 45 degree angle, or 10 degree angle or whatever. But there will be ONE line or axis that with the data-points projected onto it, captures the maximum amount of variance in the data. If the two dimensional cloud of datapoints is shaped like a hot dog bun, then the axis that runs along the middle of the bun lengthwise, laying inside it like a hot dog, will be the axis that explains the most variation and thus be the PRINCIPAL component. An axis drawn orthogonal to this one, along the bun shape's width, will explain the second largest amount of variance, and be the second principal component. Since it is 2 dimensional data, there will only be 2 principal components.

This all largely makes sense to me, as does the math we use to determine this. What I'm struggling to understand, is the value in the chart below for PC1, sample 1:

modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))

modeling[1:5,1:5]
        DiseaseStatus BMI        PC1         PC2     gene1
sample1          case  12 -18.485678   8.3374667 0.3603136
sample2          case  12  -3.393260  -0.9962191 1.1597494
sample3          case  18   6.579711 -10.1392232 3.0826237
sample4          case  20  13.311610  26.2884376 3.3093231
sample5          case  27  16.998644  -3.0140936 0.5730092

-18.485678 for example. What is that? It was my understanding that when you take your expression data and do PCA you're getting the axis or lines that explain the largest amount of variance for ALL the data. So why aren't all the principal component values the same for all samples in this chart? After all, all the data from all samples was contained within that cloud that we divided up when finding our principal components wasn't it, so why isn't the value representing PC1 the same for all samples? Is this some measure of sample 1's contribution to Principal component 1? Is it the variance for all of sample 1s data-points (genes) projected onto the line of PC1? Is it a distinct principal component that comes from doing PCA on the sample 1 data-points alone? If so, how does THAT fit with the overall principal component determined for all data-points from all samples? Or is it a scaling factor of some sort that adjusts for the variance introduced by PC1 to sample 1's data?

I really hope this isn't a silly question, but as I've said I'm pretty new to this analysis and I feel like I'm missing something important here! Thanks again for this and other posts, it really has been incredibly helpful.

ADD REPLY
0
Entering edit mode

Hi, I have created a PCA using SNPRelate. Is there a way to calculate residuals for PCA as I want to run a regression model using GWAS=lm(Y~PCA) where Y will be the residuals.

ADD REPLY

Login before adding your answer.

Traffic: 2089 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6