Entering edit mode
8.3 years ago
cmdcolin
★
4.0k
I was trying to filter VCF files by sample using vcftools, and I'm testing on the 1000 genomes datasets
If I try to filter by CEU samples for example, I can try this
vcftools --gzvcf ALL.chr1.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --recode --out CEU --keep CEU.tsv
Where CEU.tsv contains the sample IDs that are from the CEU population
The thing is that this appears to include variants where there are no variations in the kept samples. I tried settings --min-alleles also, but this didn't seem to fix it.
This operation is also pretty slow...any faster ways to do it?
Thanks again for this answer. Finding my own questions in a google search now 2 years later. Note that a relatively recent version of bcftools should be used e.g. the one from htslib simple because the options like --min-ac don't exist in the old 0.1.19 from the samtools package. If someone just wants a single sample you can just use
bcftools view -s HG00096 --min-ac 1 100genomes.vcf.gz
where --min-ac makes sure that there is at least 1 non-reference allele in the resulting output for that sample