1, create random data
randomdata <- matrix(rexp(800, rate=.1), ncol = 40)
colnames(randomdata) <- paste0("sample", 1:ncol(randomdata))
rownames(randomdata) <- paste0("gene", 1:nrow(randomdata))
randomdata[1:5,1:5]
sample1 sample2 sample3 sample4 sample5
gene1 0.3603136 1.159749 3.0826237 3.309323 0.5730092
gene2 35.0652692 20.814076 5.7487437 3.502387 1.2635026
gene3 8.2158417 6.442191 2.1134863 26.724393 9.2087334
gene4 19.6870141 6.617958 5.7245038 10.698933 9.1174246
gene5 6.8046549 5.262233 0.2679569 25.170229 8.3293813
2, create fake metadata for DiseaseStatus
and BMI
metadata <- data.frame(DiseaseStatus = c(rep("case", 20), rep("control", 20)), BMI = sample(10:40, 40, replace = TRUE))
head(metadata)
DiseaseStatus BMI
1 case 12
2 case 12
3 case 18
4 case 20
5 case 27
6 case 32
3, perform PCA and plot bi-plot for PC1 and PC2
pca <- prcomp(t(randomdata))
plot(pca$x[,1:2])
Also see my answer here for more 'PCA' plots using base R functions: A: PCA plot from read count matrix from RNA-Seq
4, now prepare an object for regression modeling
modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))
modeling[1:5,1:5]
DiseaseStatus BMI PC1 PC2 gene1
sample1 case 12 -18.485678 8.3374667 0.3603136
sample2 case 12 -3.393260 -0.9962191 1.1597494
sample3 case 18 6.579711 -10.1392232 3.0826237
sample4 case 20 13.311610 26.2884376 3.3093231
sample5 case 27 16.998644 -3.0140936 0.5730092
# set control as reference level
modeling$DiseaseStatus <- factor(modeling$DiseaseStatus, levels = c('control','case'))
5, binary logistic regression - adjust for BMI
, PC1
, and PC2
summary(glm(DiseaseStatus ~ gene1 + BMI + PC1 + PC2, data = modeling, family = binomial(link = 'logit')))
Call:
glm(formula = DiseaseStatus ~ gene1 + BMI + PC1 + PC2, family = binomial(link = "logit"),
data = modeling)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.43692 -1.12846 0.01018 1.15101 1.33862
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.266677 1.030321 -0.259 0.796
gene1 -0.009663 0.041650 -0.232 0.817
BMI 0.013031 0.036195 0.360 0.719
PC1 -0.004580 0.018005 -0.254 0.799
PC2 -0.010359 0.020470 -0.506 0.613
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 55.452 on 39 degrees of freedom
Residual deviance: 54.908 on 35 degrees of freedom
AIC: 64.908
Number of Fisher Scoring iterations: 4
6, linear regression - adjust for BMI
, PC1
, and PC2
summary(lm(gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling))
Call:
lm(formula = gene1 ~ DiseaseStatus + BMI + PC1 + PC2, data = modeling)
Residuals:
Min 1Q Median 3Q Max
-8.447 -5.121 -2.338 3.445 31.670
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 8.25736 4.11994 2.004 0.0528 .
DiseaseStatuscase -0.56933 2.60968 -0.218 0.8286
BMI -0.04532 0.14733 -0.308 0.7602
PC1 0.13024 0.06949 1.874 0.0693 .
PC2 -0.11322 0.08107 -1.397 0.1713
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 8.202 on 35 degrees of freedom
Multiple R-squared: 0.1373, Adjusted R-squared: 0.03869
F-statistic: 1.392 on 4 and 35 DF, p-value: 0.2567
-------------------------------------
You should have good justification for wanting to adjust for PCs. Usually, it is performed in genetic association studies for the purpose of adjusting for population stratification.
Don't just adjust for PCs because your supervisor may have told you to do it - question him/her why. One should not add covariates to a regression model without justification.
Kevin
Will do. As always, thank you for your in detailed and helpful response!
You are very welcome, RNAseqer
Hello Kevin,
I'ts been a many years since I did anything with principal components and over the past few months I've done a lot of brushing up on the subject but I'm just having a bit of trouble understanding what some of their values are in the above covariate analysis.
As I understand it PCA works like this: given a bunch of data-points of different values there will be variance between within the data. If you were to plot your data-points in 2 dimensions you can see the variance in X dimension, and variance in the Y dimension. Plotting in 3 dimensions you can see variance along the x,y & z axis, and in fact there can be variance within the data along many many more dimensions than we can actually see with three dimensions. But, stated crudely, principal component analysis basically works like this: a principal component is an axis or line that can be drawn through the cloud of data-points in any one of those dimensions. If we imagine two dimensional data, the line might be a horizontal one, or one that bisects the cloud of data-points on a 45 degree angle, or 10 degree angle or whatever. But there will be ONE line or axis that with the data-points projected onto it, captures the maximum amount of variance in the data. If the two dimensional cloud of datapoints is shaped like a hot dog bun, then the axis that runs along the middle of the bun lengthwise, laying inside it like a hot dog, will be the axis that explains the most variation and thus be the PRINCIPAL component. An axis drawn orthogonal to this one, along the bun shape's width, will explain the second largest amount of variance, and be the second principal component. Since it is 2 dimensional data, there will only be 2 principal components.
This all largely makes sense to me, as does the math we use to determine this. What I'm struggling to understand, is the value in the chart below for PC1, sample 1:
modeling <- data.frame(metadata, pca$x[,1:2], t(randomdata))
-18.485678 for example. What is that? It was my understanding that when you take your expression data and do PCA you're getting the axis or lines that explain the largest amount of variance for ALL the data. So why aren't all the principal component values the same for all samples in this chart? After all, all the data from all samples was contained within that cloud that we divided up when finding our principal components wasn't it, so why isn't the value representing PC1 the same for all samples? Is this some measure of sample 1's contribution to Principal component 1? Is it the variance for all of sample 1s data-points (genes) projected onto the line of PC1? Is it a distinct principal component that comes from doing PCA on the sample 1 data-points alone? If so, how does THAT fit with the overall principal component determined for all data-points from all samples? Or is it a scaling factor of some sort that adjusts for the variance introduced by PC1 to sample 1's data?
I really hope this isn't a silly question, but as I've said I'm pretty new to this analysis and I feel like I'm missing something important here! Thanks again for this and other posts, it really has been incredibly helpful.
Hi, I have created a PCA using SNPRelate. Is there a way to calculate residuals for PCA as I want to run a regression model using GWAS=lm(Y~PCA) where Y will be the residuals.