What is a faster way than using bcftool's view to extract a population from 1000 Genome VCF files?
1
0
Entering edit mode
6.2 years ago
Volka ▴ 180

Hi all, I am currently trying to extract the East Asian population from the 1000 Genome Project VCF files. After trying out VCFtools and its --keep option and thinking it was taking too long (about an hour per chromosome), I moved on to using bcftools view with the following code:

for i in {1..22}
do
    bcftools view -S eas.txt -O z -o ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EAS.vcf.gz $thgenome/1KG_phase3_vcf/ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done

Currently, it is processing about 5 chromosome VCF files each hour, but I am wondering if there is a faster program for this? Or is my code written inefficiently?

Thanks!

bcftools 1000 vcf subset extract • 3.4k views
ADD COMMENT
1
Entering edit mode
6.2 years ago

Using the files from https://www.cog-genomics.org/plink/2.0/resources#1kg_phase3 :

plink2 --pfile ... --keep-if SuperPop == EAS --export vcf

should be a lot faster.

ADD COMMENT
0
Entering edit mode

Hi, thanks for the reply! I was able to download the .pgen, .pvar and predigree corrected .psam files, and extracted them if there were zipped in a .zst format. Then I made sure the files were the same names and ran the following code:

plink2 --pfile all_phase3 --keep-if SuperPop==EAS --export vcf

However, I'm getting an "Error: Malformed .pgen file.". The output is below:

PLINK v2.00a1LM 64-bit Intel (11 Feb 2018)
www.cog-genomics.org/plink/2.0/ (C) 2005-2018 Shaun Purcell, Christopher Chang GNU General Public License v3 Logging to plink2.log. Options in effect: --export vcf --keep-if SuperPop==EAS --pfile all_phase3

Start time: Wed Sep 12 14:39:24 2018 64298 MB RAM detected; reserving 32149 MB for main workspace. Using up to 48 threads (change this with --threads). 2504 samples (1271 females, 1233 males; 2497 founders) loaded from all_phase3.psam. 84805772 variants loaded from all_phase3.pvar. 2 categorical phenotypes loaded. --keep-if: 2000 samples removed. 504 samples (260 females, 244 males; 504 founders) remaining after main filters. --export vcf to plink2.vcf ... 0% Error: Malformed .pgen file. End time: Wed Sep 12 14:40:52 2018

Do you have any suggestions on what's wrong?

ADD REPLY
1
Entering edit mode

This requires a more recent plink2 build. Alpha 1 did not support multiallelic variants.

ADD REPLY
0
Entering edit mode

Oh, that makes sense. I downloaded the latest development build and it works perfectly now, thank you very much! It was very quick too. Is there anything else I would have note about how plink2 handles VCF files, other than what's noted about the reference alleles and phase information in the What's New page?

ADD REPLY
0
Entering edit mode

Some situations where you should still use other tools:

  • There are extra FORMAT fields like GQ and DP, and you still need to retain them (plink2 can filter on them, but it won’t save the original values).

  • There are {P(AA), P(AB), P(BB)} genotype likelihood triplets, and you need to retain all of these values instead of collapsing them into a single dosage.

  • There’s unusual ploidy to keep track of (e.g. trisomy 21).

ADD REPLY

Login before adding your answer.

Traffic: 1578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6