Principle component analysis using VCF file as input.
4
0
Entering edit mode
3.8 years ago

I have virus sequences from different geography for that I have perform Population structure analysis using STRUCTURE software. It give me Kopt at K=3. Now I want to perform PCA for these using Eigensoft but I have only vcf files. i have no case control data. how should i use VCF as input data.

PCA VCF • 4.6k views
ADD COMMENT
1
Entering edit mode
3.8 years ago
4galaxy77 2.9k

I would reccomend first converting to plink format (I have found a couple of odd things happening when you use a vcf directly).

plink2 --vcf data.vcf --make-bed --out data

If you haven't already, it's a good thing to LD prune and remove rare variants

plink2 --bfile data --maf 0.01 --indep-pairwise 50 5 0.2 --out data_clean
plink2 --bfile data --extract data_clean.in --make-bed --out data_clean_prune

then do the PCA.

plink2 --bfile data_clean_prune --pca --out data_clean_prune
ADD COMMENT
1
Entering edit mode
3.8 years ago
tothepoint ▴ 940

You can make PCA plot from VCF file using SNPRelate R package. There is already a relevant post VCF to PCA you can check it.

ADD COMMENT
0
Entering edit mode
3.8 years ago

I run this command but it make only fim

Start time: Thu Jan 21 02:35:20 2021
3877 MiB RAM detected; reserving 1938 MiB for main workspace.
Using up to 4 compute threads.
--vcf: 4690 variants scanned.
--vcf: data-temporary.pgen + data-temporary.pvar + data-temporary.psam written.
822 samples (0 females, 0 males, 822 ambiguous; 822 founders) loaded from data-temporary.psam. 4690 variants loaded from data-temporary.pvar. Note: No phenotype data present. Writing data.fam ... done. Writing data.bim ... Error: data.bim cannot contain multiallelic variants. End time: Thu Jan 21 02:35:20 2021

ADD COMMENT
0
Entering edit mode

The important error here is "Error: data.bim cannot contain multiallelic variants".

ADD REPLY
0
Entering edit mode

Use --make-pgen/--pfile instead of --make-bed/--bfile when working with multiallelic variants.

ADD REPLY
0
Entering edit mode

I use pgen command it give me three files i.e. pgen, pvar, and psam. how can i use these file for plotting PCA. plz guide me further.

--vcf: 4690 variants scanned. --vcf: NV-temporary.pgen + NV-temporary.pvar.zst + NV-temporary.psam written. 822 samples (0 females, 0 males, 822 ambiguous; 822 founders) loaded from NV-temporary.psam. 4690 variants loaded from NV-temporary.pvar.zst. Note: No phenotype data present. Writing NV.psam ... done. Writing NV.pvar ... done. Writing NV.pgen ... done. End time: Mon Jan 25 10:24:45 2021

ADD REPLY
0
Entering edit mode

i use this command for PCA (plink2 --pfile file --PCA) it give me error failed to open .psam file.

ADD REPLY
0
Entering edit mode

Did you try googling the error message?

ADD REPLY
0
Entering edit mode

thanks it solved i have got two files eigenvec and eigen value. i am confused that my psm file have no sex(male female) information, will this create any bias in result?

ADD REPLY
0
Entering edit mode

also guide me how can now use egeinvec and eigen value in R for plotting pca

ADD REPLY
0
Entering edit mode

please guide me. I have low knowledge about plink and PCA.

ADD REPLY
0
Entering edit mode
2.3 years ago
hewm2008 ▴ 50

I recently developed a brand new pca analysis software MingPCACluster that can go from vcf to pca and graph( (VCF2PCA and figture)). Very fast and low memory, accurate and very precise

https://github.com/hewm2008/MingPCACluster

### run without pop.info
     #   ./bin/MingPCACluster   -InVCF  Khuman.vcf.gz   -OutPut OUT
### run with  pop.info
    ./bin/MingPCACluster    -InVCF  Khuman.vcf.gz   -OutPut OUT -InSampleGroup  pop.info 
ADD COMMENT

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6