Entering edit mode
6.2 years ago
Volka
▴
180
Hi all, I am currently trying to extract the East Asian population from the 1000 Genome Project VCF files. After trying out VCFtools and its --keep option and thinking it was taking too long (about an hour per chromosome), I moved on to using bcftools view with the following code:
for i in {1..22}
do
bcftools view -S eas.txt -O z -o ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.EAS.vcf.gz $thgenome/1KG_phase3_vcf/ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/ALL.chr"$i".phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
done
Currently, it is processing about 5 chromosome VCF files each hour, but I am wondering if there is a faster program for this? Or is my code written inefficiently?
Thanks!
Hi, thanks for the reply! I was able to download the .pgen, .pvar and predigree corrected .psam files, and extracted them if there were zipped in a .zst format. Then I made sure the files were the same names and ran the following code:
However, I'm getting an "Error: Malformed .pgen file.". The output is below:
Do you have any suggestions on what's wrong?
This requires a more recent plink2 build. Alpha 1 did not support multiallelic variants.
Oh, that makes sense. I downloaded the latest development build and it works perfectly now, thank you very much! It was very quick too. Is there anything else I would have note about how plink2 handles VCF files, other than what's noted about the reference alleles and phase information in the What's New page?
Some situations where you should still use other tools:
There are extra FORMAT fields like GQ and DP, and you still need to retain them (plink2 can filter on them, but it won’t save the original values).
There are {P(AA), P(AB), P(BB)} genotype likelihood triplets, and you need to retain all of these values instead of collapsing them into a single dosage.
There’s unusual ploidy to keep track of (e.g. trisomy 21).