Hello,
I am planning to run regression in plink via: --glm genotypic And I am wondering should I include all these covariates. So how would I find out which PCs are most strongly correlated with my covariates: td, array, HBA1C, age.... I should add that this is case-control study. The reasoning behind this is that if say PC4 turns out to be most strongly correlated with "pheno" case status (2/1) (not independant of it) then the adjustment for the PC4 will also remove some of the treatment effect.
and my covariate file looks like this:
FID IID pheno sex age PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 td array HBA1C
1000025 1000025 -9 Female 58 -13.0692 3.82546 -2.80895 -0.120865 -9.5369 2.25826 0.975514 -2.22046 1.15324 3.67727 -9 Biobank 29.8
1000056 1000056 -9 Female 57 -13.0419 4.07604 -2.35979 2.32617 -0.73101 -1.56894 1.64281 -2.55201 -2.5772 0.111892 -9 Bileve NA
1000089 1000089 -9 Female 66 -9.557 4.27054 -0.91537 -0.160893 -2.27807 -1.08545 -0.551797 0.360637 0.30142 -1.28888 -9 Biobank 28.1
1000093 1000093 1 Female 65 -11.4407 0.716765 -0.0932982 -1.22921 2.56851 -0.0708945 2.841 1.39155 -5.13552 -1.97563 2 Biobank 65.3
1000115 1000115 -9 Male 51 -15.0062 6.97111 -2.93247 -3.03366 -3.73211 1.80292 0.71272 1.93733 -1.83419 -1.46217 -9 Biobank 30.7
...
I guess I can use something like this can get correlation coefficients:
library(psych)
pairs.panels(mydata)
Say that I have high correlation coefficient between PC4 and pheno, should I not include PC4 in my analysis?
Is it better to first say do "scree" plot in R and determine which PCs to keep and than include only those in my covariate file and do the above analysis via pairs.panels()?
Any thoughts would be highly appreciated!
HI Kevin,
Thank you so much for getting back to me.
So this is what I did:
I extracted just my covariates in data frame (without previously calculated PCs for my UKB data)
and it looks like this:
Then I calculated PCs only for those covariates and determined the proportion of variance explained via:
you can see that none of the covariates (sex age td array HBA1C) contributes largely to variance...Does that mean I keep all of them?
Also what about calculating Principal components for my genomic data ( I already got calculated PCs for that from UKB) how do I select there how many to keep? Is "scree" plot in R valuable estimation there?
And yes it is Diabetic Retinopathy :)
on another hand if I keep my whole covariate file with all 10 PCs and all 6 covariates inclusing "pheno" (makes it total to 16) and I calculate correlations between them, none of them is passing 0.7
So what would mean I keep all coavriates and all PCs?
Hi again, I am not sure that there is much justification for calculating PC covariates for these (sex, age, td, array, and HBA1C). For example, some of these should be categorical variables, but PCA will regard them as continuous variables. I thought that you had other types of variables in your dataset.
I would just literally determine which of these are potential confounders via independent models, and then only include those that are statistically significant: