Count numbers of population level and sample level SNPs in my vcf

0

Entering edit mode

10.0 years ago

Cece ▴ 30

I'm a python newbie and I'm trying to extract data on population and sample level SNPs for the AFR population from a vcf. Data looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109 HG00110 HG00111 HG001
20      60343   .       G       A       100     PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=20377;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||  GT      0|0     0|0     0|0

I've managed to come up with the following:

for line in open('filename.vcf.gz', 'r'):
    if line [:2] == '##':
        continue
    if line[0] == '#':
        data = line.rstrip().split('\t')
        ncol = len(data)
        for I in range(9,ncol):

which should clean up and isolate my data columns, but I'm lost as to how to achieve the rest.

python SNP • 3.0k views

ADD COMMENT • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by Cece ▴ 30

1

Entering edit mode

If you don't want to write your own: https://github.com/jewmanchue/vcflib/wiki/Basic-population-statistics-with-GPAT

ADD REPLY • link 10.0 years ago by Zev.Kronenberg 12k

0

Entering edit mode

Thanks, but I'm learning python as well as bioinformatics so I'm working on learning how to write my own scripts.

ADD REPLY • link 10.0 years ago by Cece ▴ 30

0

Entering edit mode

Good plan. Here is an API that could be helpful: https://github.com/jamescasbon/PyVCF

ADD REPLY • link updated 2.7 years ago by Ram 45k • written 10.0 years ago by Zev.Kronenberg 12k

Login before adding your answer.