Entering edit mode
9.4 years ago
Cece
▴
30
I'm a python newbie and I'm trying to extract data on population and sample level SNPs for the AFR population from a vcf. Data looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT HG00096 HG00097 HG00099 HG00100 HG00101 HG00102 HG00103 HG00105 HG00106 HG00107 HG00108 HG00109 HG00110 HG00111 HG001
20 60343 . G A 100 PASS AC=1;AF=0.000199681;AN=5008;NS=2504;DP=20377;EAS_AF=0;AMR_AF=0.0014;AFR_AF=0;EUR_AF=0;SAS_AF=0;AA=.||| GT 0|0 0|0 0|0
I've managed to come up with the following:
for line in open('filename.vcf.gz', 'r'):
if line [:2] == '##':
continue
if line[0] == '#':
data = line.rstrip().split('\t')
ncol = len(data)
for I in range(9,ncol):
which should clean up and isolate my data columns, but I'm lost as to how to achieve the rest.
If you don't want to write your own: https://github.com/jewmanchue/vcflib/wiki/Basic-population-statistics-with-GPAT
Thanks, but I'm learning python as well as bioinformatics so I'm working on learning how to write my own scripts.
Good plan. Here is an API that could be helpful: https://github.com/jamescasbon/PyVCF