Question

Split 1000 genome f VCF by subpopulation

1

Entering edit mode

4.4 years ago

ThePlaintiff ▴ 90

How do I split 1000 genome VCF files by sub-populations while retaining variants that are only present in the sub-population? For example, if I have 1000 genome chromosome 10 file as chr10.vcf, I'd like to get from it: chr10_LWK.vcf (LWK subpopulation), chr10_YRI.vcf (YRI subpopulation) e.t.c. I then would like to find snps that are present in LWK but absent in YRI using bcftools isec or contrast.

Thanks

SNP genome next-gen • 1.1k views

ADD COMMENT • link updated 4.4 years ago by Kevin Blighe 88k • written 4.4 years ago by ThePlaintiff ▴ 90

score 0 · Answer 1 · 2020-06-17

0

Entering edit mode

4.4 years ago

Kevin Blighe 88k

In step 2 of this tutorial, you can obtain a PED file that contains the IID-to-population mappings: Produce PCA bi-plot for 1000 Genomes Phase III - Version 2 (IID = Individual ID). You can then create lists for each population and filter using BCFtools. For then comparing variants, my preference would be indexed AWK arrays, but, of course, feel free to use whatever you feel appropriate.

Kevin

ADD COMMENT • link 4.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Thank you Kevin. Your tutorial provided many insights.

ADD REPLY • link 4.4 years ago by ThePlaintiff ▴ 90