Hello,
I use PLINK to get PCA with a input of SNPs. I am bit stuck with that.
my parameter is that
plink --vcf ${OUTPUT}/${INPUT}.vcf --remove low_ind.cov --allow-no-sex --pca var-wts --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# --out ${OUTPUT}/${OUT} --recode vcf
Then, I have .eigenvec, eigenval, and .eigenvec.var that you see the head of output below.
My first question is, how the eigenvec.var and eigenvec are related to each other ? I want to draw a PCA plot using those values together. Sample variance and SNP weights in one plot to see which positions drive the samples. Should I just merge them into one data frame or list ?
Also, the other point I'm confused about is about eigenvalue of variant eigenvector. There is no eigenvalue of it. Might the produced eigenvalue be related to sample eigenvector or to variant eigenvector as well ? Because I can also draw separate PCA plots for variant and sample eigenvector but should I use the same eigenvalue for variant eigenvector.
The last thing is that what is the nan value ? Does that mean those numbers did not affect the sample (individual) PCAs ?
Thank you in advance !
eigenvec.var
CM018572.1 CM018572.1:422695 G A nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
CM018572.1 CM018572.1:422708 A T nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan
CM018572.1 CM018572.1:422736 T G-0.106167 0.374399 -0.382764 0.362484 -0.0506245 -0.0191916 -0.170022 -0.0713354 -0.173634 0.274229 -0.00362592 0.0792594 -0.175301 -0.550821 -1.161 -0.193439 1.00024 0.91805 0.56424 -4.53466
.eigenvec
S1 0.0261918 0.107223 -0.0933128 0.0587645 0.0263063 -0.0690121 -0.0324025 0.0554362 0.0741266 0.240196 -0.196983 0.0787568 0.242772 -0.772621 0.330559 -0.0736442 -0.0550954 -0.170204 0.020138 0.0628689
S2 0.000305038 0.0175883 -0.023663 0.00508857 0.0133318 0.0203648 0.00830506 0.0174132 0.00301209 0.0220885 0.00839291 -0.0395119 -0.00403775 -0.0114217 -0.0975636 0.0132438 -0.00266438 0.0888529 -0.0070219 -0.0344497
S3 -0.120546 -0.035378 0.266739 -0.400103 0.137936 0.133691 -0.253201 0.325438 -0.551546 -0.00728229 -0.177567 -0.0885545 0.377628 0.0647709 -0.113949 -0.0337703 0.00749686 -0.0158408 0.00423475 0.037192
eigenval
1.61066
1.43472
1.39878
So basically I want to draw a plot like this:
Regarding the nan values, what happens if you use plink 2.0 instead of 1.9 to perform the PCA? If you still see nans, can you post the full .log file from your run?
Hi chrchang523,
Plink 2.0 did not produce any nans, however, interestingly when I use the same paramater or adapt ones, plink2 produced some error but plink 1.9 did not. Here is parameters and the errors from the log files.
plink2 --vcf ${OUTPUT}/${INPUT}.vcf --allow-no-sex --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# -indep-pairwise 50 20 0.1 --out ${OUTPUT}/${OUT}
That error does not come with plink 1.9 even though it is the same parameter.
Also, when I don't perform prunning on the data, I have another error with plink2.
plink2 --vcf ${OUTPUT}/${INPUT}.vcf --allow-no-sex --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# --pca biallelic-var-wts --out ${OUTPUT}/${OUT} --recode vcf
Now it is confusing. I did not calculate any allele frequencies when I used plink 1.9. Do I have to calculate allele freq. in plink2 ? Also, please correct me If I am wrong but
var-wts
is adapted tobiallelic-var-wts
in plink2, isn't it ?Unfortunately, plink2 is correctly telling you that the analysis you're performing requires more samples that you have. plink 1.9 may give you results, but with so few samples, those results cannot be trusted.
Actually I have 42 samples, but 50 samples limitation should be a bit strict. I did not know that. So, is that error related to performing PCA? I guess I can still convert vcf into plink type file and performing prunning, can't I ?
I have another question not directly related to plink. I have a data which is very low coverage and has a lot of missing, eg more than 80% missingness.. How many SNPs should be enough for downstream analysis, and getting reliable PCA, admixture and so on ?
If you have any sort of reference dataset for this species, you should merge your 42 samples with that reference dataset before these steps. If you don't, sorry, I can't endorse trying to perform this analysis at all before collecting more samples. (50 is still a very small number; the point is that running these commands with less than 50 samples is practically ALWAYS a mistake.)
For human genome-wide analysis, you generally want at least 100000 variants (after LD pruning), with <=10% missingness within those variants. For other species, it may be okay to start with fewer variants, it depends on how much variation there is within the species.
Thank you so much ! These answers will be definitely useful for me. Probably, I will prepare a reference set from published reference genomes..
Just one more quick question, but what about for > 10% missing data, would the imputation be an alternative? Like, If there is no possiblity to get a high quality data in terms of wet-lab. Well, I am working on ancient DNA, moreoever environmental ancient DNA which is a bit more complex. That's why, I am also thinking about the imputation If it would make sense.