I am trying to do PCA for the whole genome, is it possible with snp relate package of R to get the principal components for the whole genome ? If not please suggest some feasible methods
I am trying to do PCA for the whole genome, is it possible with snp relate package of R to get the principal components for the whole genome ? If not please suggest some feasible methods
If your genotype is in vcf format, you can convert it to plink format then use the --pca function from PLINK.
If you have large number of samples (which shouldn't be the case if you are using 1000G), then you can consider flashPCA
Note: If you are directly converting the vcf to plink format, make sure you handled the multi-allelic SNPs and you might also need to manually edit the bim file such that there isn't any duplicated SNP id
I have created a tutorial to do this. Take a look here: Produce PCA for 1000 Genomes Phase III in VCF format
Thank you for sharing. I tried to run this part of the code : plink --noweb --bfile /YourDir/1000Genomes/chr$chr.1kg.phase3.v5 --list-duplicate-vars
I get the error as :plink: "unknown option --noweb" plink: "unknown option --bcf" plink: "unknown option --keep-allele-order" plink: "unknown option --vcf -idspace-to"
. I downloaded plink-1.07
Hi friend,
The find
command should list all *.bim files under /YourDir/
The grep
command then extracts only those BIM files in the /YourDir/1000Genomes/MAF/ directory
The sed
command just removes .bim from the ends of the files (required for when reading into plink)
You could also just write
find /YourDir/1000Genomes/MAF/ -name "*.bim" > /YourDir/ForMerge.list ;
sed -i 's/.bim//g' /YourDir/ForMerge.list ;
Does that work?
I don't use MAC, apologies, but I will direct you to this thread about this specific issue of using sed on MAC: http://techslides.com/grep-awk-and-sed-in-bash-on-osx
I think that it is the -i switch/parameter that causes the issue. You could also not use the -i switch/parameter and then redirect the output to a new file. This switch/parameter just instructs sed
to edit and save the file that you specify. Without it, it sends the modified version to stdout.
ForMerge.list should only contain bim files. Here is the version that I have on my laprop:
1000Genomes/MAF/chr11.1kg.phase3.v5.bim
1000Genomes/MAF/chr4.1kg.phase3.v5.bim
1000Genomes/MAF/chr22.1kg.phase3.v5.bim
1000Genomes/MAF/chr21.1kg.phase3.v5.bim
1000Genomes/MAF/chr2.1kg.phase3.v5.bim
1000Genomes/MAF/chr12.1kg.phase3.v5.bim
1000Genomes/MAF/chr13.1kg.phase3.v5.bim
1000Genomes/MAF/chrX.1kg.phase3.v5.bim
1000Genomes/MAF/chr6.1kg.phase3.v5.bim
1000Genomes/MAF/chr3.1kg.phase3.v5.bim
1000Genomes/MAF/chr8.1kg.phase3.v5.bim
1000Genomes/MAF/chr16.1kg.phase3.v5.bim
1000Genomes/MAF/chr14.1kg.phase3.v5.bim
1000Genomes/MAF/chr19.1kg.phase3.v5.bim
1000Genomes/MAF/chr10.1kg.phase3.v5.bim
1000Genomes/MAF/chr9.1kg.phase3.v5.bim
1000Genomes/MAF/chr5.1kg.phase3.v5.bim
1000Genomes/MAF/chr17.1kg.phase3.v5.bim
1000Genomes/MAF/chr18.1kg.phase3.v5.bim
1000Genomes/MAF/chr1.1kg.phase3.v5.bim
1000Genomes/MAF/chr15.1kg.phase3.v5.bim
1000Genomes/MAF/chr7.1kg.phase3.v5.bim
1000Genomes/MAF/chr20.1kg.phase3.v5.bim
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you for reply :) I used the below code to convert the file to plink vcftools --gzvcf file.gz --out myfile --plink - tped. It gives me .tfam and .tped files. Will these two files be sufficient to proceed with PCA?
As mentioned by GabrielMontenegro, you will need to do the following:
--indep
or--indep-pairwise
--extract
--pca
(You can perform 4 and 5 together in one single command)
Thank you . It was helpful
Can u please let me know how to go with cleaning the multi allelic sites. I am new to Genomics
You can read here. Though I do suggest you to follow Kevin's tutorial
PLINK can handle VCF files. The problem as Sam mentioned is that you will first need to exclude multi-allelic sites. Also, before doing a PCA, you will have to remove SNPs on LD (i.e. prune your data). You can use the
--indep
orindep-pairwise
function in PLINK too.