Hi All,
I need to do some permutation test to random sampling millions of SNPs set genotyping data from 1000 genome phase III dataset. each SNP-set only contains ~300 SNPs, however, I found the extract process is quite slow. for example, for the ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf, to extract these 300 SNPs require almost 20 mins.
plink --vcf ~/hpc/db/hg19/1000Genome/chr6.uni.vcf --extract chr6.rs6457620.txt --r inter-chr dprime --out rs6457620
Is there any fast way to extract the subset quickly?
real 17m52.822s user 8m39.925s sys 1m20.681s
Done! Thanks.
tabix ~/hpc/db/hg19/1000Genome/chr6.uni.vcf.gz -R chr6.rs6457620.txt > output.vcf
100 times faster!!!
Thanks.
Check @Kevin's post here:
Produce PCA bi-plot for 1000 Genomes Phase III in VCF format
Even tabix is slow, though. If you want quick access and manipulation of VCF-formatted data, then you should convert your data to BCF (Binary Call Format).
BCFtools view
would have been another option for you and could do the filtering in [likely] under 1 minute. I regularly filter / look up the entire 1000 Genomes Phase III data (1.3 gigabytes in BCF format).BCFtools is rapid because it utilises
htslib
I take this to make my first comparison between
tabix
and compressedbcf
. For this I've made abed
file out of 30000 randomly selected SNP from the 1000 Genomes file.Here's the result:
Nice :)
I'm not sure if this is the reason. I thought
tabix
utiliseshtslib
as well.fin swimmer
Interesting!
Great try. Thanks fin swimmer!
Great suggestion! Is there any suggestion on how to merge chromosome-splitted VCF to one big BCF?
If the samples are the same in all VCFs, you can do this with cat/grep/etc.
There is a pre-merged 1000 Genomes phase 3 dataset in plink2 format, with annotations included, at https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 (follow the "Quick start" instructions). You can export a merged VCF from this. (You can also use it directly; almost everything other than position-based lookup of a tiny number of SNPs is much faster with plink2 than bcftools.)
Thanks Chris. I just find I need transfer between vcftools and plink again and again since Plink will remove the 'phase' status of the data. Is there any way to keep the phased genotypes in plink format?
Stick to plink2 --pfile/--make-pgen when you need to preserve phase.