Hi,
I'm using PLINK (version 1.9) and encountered a memory error while using the --pca
command, so I decided to use SNPRelate in RStudio instead.
To test it out, I compared the eigenvectors and eigenvalues I got from a PLINK PCA of a different dataset to the results of the SNPRelate PCA. The values are pretty similar, but they just seem to be of different signs? I'm not sure why this occurred and how it can be resolved, and which eigenvectors are the correct ones.
Note: file names and sample ids changed.
PLINK
PLINK v1.90b6.15 64-bit (21 Jan 2020)
Options in effect:
--bfile data_pruned
--out data_PC
--pca 10
Random number seed: 1583390967
8192 MB RAM detected; reserving 4096 MB for main workspace.
352395 variants loaded from .bim file.
491 people (0 males, 491 females) loaded from .fam.
491 phenotype values loaded from .fam.
Using up to 4 threads (change this with --threads).
Before main variant filters, 491 founders and 0 nonfounders present.
Calculating allele frequencies... done.
Total genotyping rate is 0.998232.
352395 variants and 491 people pass filters and QC.
Among remaining phenotypes, 204 are cases and 287 are controls.
Excluding 7715 variants on non-autosomes from relationship matrix calc.
Relationship matrix calculation complete.
--pca: Results saved to data_PC.eigenval and data_PC.eigenvec .
data_PC.eigenvec
sample_1 -3.28359e-05 0.0472202 -0.033297
sample_2 0.0185079 -0.123125 -0.048018
sample_3 0.00786112 -0.066443 -0.0107746
sample_4 -0.00904159 -0.0490187 0.0346168
sample_5 0.00430433 0.0795848 -0.0236887
SNPRelate
#test using data
bed1 <- "data_pruned.bed"
fam1 <- "data_pruned.fam"
bim1 <- "data_pruned.bim"
snpgdsBED2GDS(bed1,fam1,bim1,"data.gds")
genofile <- snpgdsOpen("data.gds")
pca1 <- snpgdsPCA(genofile)
df <- data.framesample.id=pca1$sample.id,
EV1 = pca1$eigenvect[,1],
EV2 = pca1$eigenvect[,2],
EV3 = pca1$eigenvect[,3],
stringsAsFactors= FALSE)
.
> snpgdsPCA(genofile)
Principal Component Analysis (PCA) on genotypes:
Excluding 7,749 SNPs on non-autosomes
Excluding 0 SNP (monomorphic: TRUE, MAF: NaN, missing rate: NaN)
Working space: 491 samples, 344,646 SNPs
using 1 (CPU) core
PCA: the sum of all selected genotypes (0,1,2) = 87296725
CPU capabilities: Double-Precision SSE2
Sat Apr 4 23:52:15 2020 (internal increment: 724)
[==================================================] 100%, completed, 19s
Sat Apr 4 23:52:34 2020 Begin (eigenvalues and eigenvectors)
Sat Apr 4 23:52:34 2020 Done.
.
head(df)
sample.id EV1 EV2 EV3
1 sample_1 2.202874e-05 -0.04652531 0.03281472
2 sample_2 -1.801974e-02 0.11138703 0.04055612
3 sample_3 -7.873690e-03 0.06720024 0.01055124
4 sample_4 9.048831e-03 0.04950324 -0.03596400
5 sample_5 -4.337900e-03 -0.08086678 0.02444902
Please don't delete posts once they have received a comment/answer.