Question

PLINK - individual variance with the SNPs weights

0

Entering edit mode

2.2 years ago

young_bioinformatician ▴ 240

Hello,

I use PLINK to get PCA with a input of SNPs. I am bit stuck with that.

my parameter is that

plink --vcf ${OUTPUT}/${INPUT}.vcf --remove low_ind.cov --allow-no-sex --pca var-wts --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# --out ${OUTPUT}/${OUT} --recode vcf

Then, I have .eigenvec, eigenval, and .eigenvec.var that you see the head of output below.

My first question is, how the eigenvec.var and eigenvec are related to each other ? I want to draw a PCA plot using those values together. Sample variance and SNP weights in one plot to see which positions drive the samples. Should I just merge them into one data frame or list ?

Also, the other point I'm confused about is about eigenvalue of variant eigenvector. There is no eigenvalue of it. Might the produced eigenvalue be related to sample eigenvector or to variant eigenvector as well ? Because I can also draw separate PCA plots for variant and sample eigenvector but should I use the same eigenvalue for variant eigenvector.

The last thing is that what is the nan value ? Does that mean those numbers did not affect the sample (individual) PCAs ?

Thank you in advance !

eigenvec.var

CM018572.1 CM018572.1:422695 G A nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan

CM018572.1 CM018572.1:422708 A T nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan nan

CM018572.1 CM018572.1:422736 T G-0.106167 0.374399 -0.382764 0.362484 -0.0506245 -0.0191916 -0.170022 -0.0713354 -0.173634 0.274229 -0.00362592 0.0792594 -0.175301 -0.550821 -1.161 -0.193439 1.00024 0.91805 0.56424 -4.53466

.eigenvec

S1 0.0261918 0.107223 -0.0933128 0.0587645 0.0263063 -0.0690121 -0.0324025 0.0554362 0.0741266 0.240196 -0.196983 0.0787568 0.242772 -0.772621 0.330559 -0.0736442 -0.0550954 -0.170204 0.020138 0.0628689

S2 0.000305038 0.0175883 -0.023663 0.00508857 0.0133318 0.0203648 0.00830506 0.0174132 0.00301209 0.0220885 0.00839291 -0.0395119 -0.00403775 -0.0114217 -0.0975636 0.0132438 -0.00266438 0.0888529 -0.0070219 -0.0344497

S3 -0.120546 -0.035378 0.266739 -0.400103 0.137936 0.133691 -0.253201 0.325438 -0.551546 -0.00728229 -0.177567 -0.0885545 0.377628 0.0647709 -0.113949 -0.0337703 0.00749686 -0.0158408 0.00423475 0.037192

eigenval

1.61066

1.43472

1.39878

So basically I want to draw a plot like this:

PCA example

plink snp variant population pca • 2.3k views

ADD COMMENT • link 2.2 years ago by young_bioinformatician ▴ 240

0

Entering edit mode

Regarding the nan values, what happens if you use plink 2.0 instead of 1.9 to perform the PCA? If you still see nans, can you post the full .log file from your run?

ADD REPLY • link 2.2 years ago by chrchang523 11k

0

Entering edit mode

Hi chrchang523,

Plink 2.0 did not produce any nans, however, interestingly when I use the same paramater or adapt ones, plink2 produced some error but plink 1.9 did not. Here is parameters and the errors from the log files.

plink2 --vcf ${OUTPUT}/${INPUT}.vcf --allow-no-sex --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# -indep-pairwise 50 20 0.1 --out ${OUTPUT}/${OUT}

".... Error: This run estimates linkage disequilibrium between variants, but here are less than 50 samples to estimate from. You should perform this operation on a larger dataset...."

That error does not come with plink 1.9 even though it is the same parameter.

Also, when I don't perform prunning on the data, I have another error with plink2.

plink2 --vcf ${OUTPUT}/${INPUT}.vcf --allow-no-sex --make-bed --double-id --allow-extra-chr --set-missing-var-ids @:# --pca biallelic-var-wts --out ${OUTPUT}/${OUT} --recode vcf

"... Error: This run requires decent allele frequencies, but they aren't being loaded with --read-freq, and less than 50 samples are available to impute them from. You should generate (with --freq) or obtain an allele frequency file based on a large similar-population reference dataset, and load it with --read-freq..."

Now it is confusing. I did not calculate any allele frequencies when I used plink 1.9. Do I have to calculate allele freq. in plink2 ? Also, please correct me If I am wrong but var-wts is adapted to biallelic-var-wts in plink2, isn't it ?

ADD REPLY • link 2.2 years ago by young_bioinformatician ▴ 240

1

Entering edit mode

Unfortunately, plink2 is correctly telling you that the analysis you're performing requires more samples that you have. plink 1.9 may give you results, but with so few samples, those results cannot be trusted.

ADD REPLY • link 2.2 years ago by chrchang523 11k

0

Entering edit mode

Actually I have 42 samples, but 50 samples limitation should be a bit strict. I did not know that. So, is that error related to performing PCA? I guess I can still convert vcf into plink type file and performing prunning, can't I ?

I have another question not directly related to plink. I have a data which is very low coverage and has a lot of missing, eg more than 80% missingness.. How many SNPs should be enough for downstream analysis, and getting reliable PCA, admixture and so on ?

ADD REPLY • link 2.2 years ago by young_bioinformatician ▴ 240

2

Entering edit mode

If you have any sort of reference dataset for this species, you should merge your 42 samples with that reference dataset before these steps. If you don't, sorry, I can't endorse trying to perform this analysis at all before collecting more samples. (50 is still a very small number; the point is that running these commands with less than 50 samples is practically ALWAYS a mistake.)
For human genome-wide analysis, you generally want at least 100000 variants (after LD pruning), with <=10% missingness within those variants. For other species, it may be okay to start with fewer variants, it depends on how much variation there is within the species.

ADD REPLY • link 2.2 years ago by chrchang523 11k

0

Entering edit mode

Thank you so much ! These answers will be definitely useful for me. Probably, I will prepare a reference set from published reference genomes..

Just one more quick question, but what about for > 10% missing data, would the imputation be an alternative? Like, If there is no possiblity to get a high quality data in terms of wet-lab. Well, I am working on ancient DNA, moreoever environmental ancient DNA which is a bit more complex. That's why, I am also thinking about the imputation If it would make sense.

ADD REPLY • link 2.2 years ago by young_bioinformatician ▴ 240