Count numbers of population level and sample level SNPs in my vcf
0
0
Entering edit mode
9.4 years ago
Cece ▴ 30

I'm a python newbie and I'm trying to extract data on population and sample level SNPs for the AFR population from a vcf. Data looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109 HG00110 HG00111 HG001
20      60343   .       G       A       100     PASS    AC=1;AF=0.000199681;AN=5008;NS=2504;DP=20377;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.|||  GT      0|0     0|0     0|0

I've managed to come up with the following:

for line in open('filename.vcf.gz', 'r'):
    if line [:2] == '##':
        continue
    if line[0] == '#':
        data = line.rstrip().split('\t')
        ncol = len(data)
        for I in range(9,ncol):

which should clean up and isolate my data columns, but I'm lost as to how to achieve the rest.

python SNP • 2.8k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Thanks, but I'm learning python as well as bioinformatics so I'm working on learning how to write my own scripts.

ADD REPLY
0
Entering edit mode

Good plan. Here is an API that could be helpful: https://github.com/jamescasbon/PyVCF

ADD REPLY

Login before adding your answer.

Traffic: 1787 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6