Below are Python solutions using the pyvcf
submodule from the fuc
package I wrote.
Assume you have the following data:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr2'],
... 'POS': [100, 101, 500],
... 'ID': ['.', '.', '.'],
... 'REF': ['G', 'T', 'A'],
... 'ALT': ['A', 'C', 'T'],
... 'QUAL': ['.', '.', '.'],
... 'FILTER': ['.', '.', '.'],
... 'INFO': ['.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT'],
... 'Male1': ['0/1', '0/0', '0/0'],
... 'Male2': ['1/1', '0/1', '0/0'],
... 'Female1': ['0/0', '0/0', '0/0'],
... 'Female2': ['0/0', '0/0', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0 chr1 100 . G A . . . GT 0/1 1/1 0/0 0/0
1 chr1 101 . T C . . . GT 0/0 0/1 0/0 0/0
2 chr2 500 . A T . . . GT 0/0 0/0 0/0 0/1
You can 'slice' your VCF for given region(s):
>>> vf = vf.slice('chr1:100-200')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0 chr1 100 . G A . . . GT 0/1 1/1 0/0 0/0
1 chr1 101 . T C . . . GT 0/0 0/1 0/0 0/0
In your case, you would probably want to simultaneously import and slice the VCF:
# vf = pyvcf.VcfFrame.from_file('in.vcf', regions='chr1:100-200')
As for getting allele frequency of a variant for different sample groups:
>>> import pandas as pd
>>>
>>> male_vf = vf.subset(['Male1', 'Male2'])
>>> female_vf = vf.subset(['Female1', 'Female2'])
>>>
>>> def one_gt(g):
... return sum([int(x) for x in g.split('/')])
...
>>> def one_row(r):
... variant = f'{r.CHROM}-{r.POS}-{r.REF}-{r.ALT}'
... frequency = sum(r[9:].apply(one_gt)) / (len(r[9:]) * 2)
... return pd.Series([variant, frequency])
...
>>> male_df = male_vf.df.apply(one_row, axis=1)
>>> female_df = female_vf.df.apply(one_row, axis=1)
>>>
>>> male_df.columns = ['Variant', 'Male']
>>> female_df.columns = ['Variant', 'Female']
>>>
>>> male_df.merge(female_df, on='Variant')
Variant Male Female
0 chr1-100-G-A 0.75 0.0
1 chr1-101-T-C 0.25 0.0
Let me know if you need further help.