Hi everyone,
I'm running some analysis on ADMIXTURE in order to uncover population structure based on Arabidopsis thaliana accession found on the 1001 genomes project. To sum up fast what I did till now: First I have removed nearly-identical accessions,by calculating pairwise genome-wide identity-by-state differences using PLINK, and when pairs differed in less than < 0.01 changes per polymorphic site, I have randomly removed one member of the pair; then I have identified only biallelic SNPs with a genotype calling rate >95%, which resulted in a genome matrix of ~4 million SNPs, like so:
bcftools view -i ‘F_MISSING<0.05’ -m2 -M2 -v snps myvcf.vcf.gz -Oz -o myfilteredoutput.vcf.gz
After recoding it with plink in bed format, I'm now running ADMIXTURE using the cross-validation method to select the best K like so:
admixture –cv myfilteredoutput 2 > log2.out
I'm doing this for every K from 1 to 20, but it take AN IMMENSE amount of time (almost one day or more for each single K)
What I might do to speed up the process?
Thanks