If you don't mind a Python solution, below is one using the pyvcf
submodule from the fuc
package I wrote.
Let's say your VCF contains samples from two populations A and B.
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102],
... 'ID': ['.', '.', '.'],
... 'REF': ['G', 'T', 'T'],
... 'ALT': ['A', 'C', 'A'],
... 'QUAL': ['.', '.', '.'],
... 'FILTER': ['.', '.', '.'],
... 'INFO': ['.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT'],
... 'A1': ['0/1', '0/1', '0/1'],
... 'A2': ['0/0', '1/1', '0/0'],
... 'A3': ['0/1', '0/1', '0/0'],
... 'B1': ['0/0', '0/1', '1/1'],
... 'B2': ['0/0', '1/1', '0/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A1 A2 A3 B1 B2
0 chr1 100 . G A . . . GT 0/1 0/0 0/1 0/0 0/0
1 chr1 101 . T C . . . GT 0/1 1/1 0/1 0/1 1/1
2 chr1 102 . T A . . . GT 0/1 0/0 0/0 1/1 0/1
Identify variants with AF >= 0.2 in population A.
>>> A_vf = vf.subset(['A1', 'A2', 'A3'])
>>> A_vf = A_vf.compute_info('AF')
>>> A_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A1 A2 A3
0 chr1 100 . G A . . AF=0.333 GT 0/1 0/0 0/1
1 chr1 101 . T C . . AF=0.667 GT 0/1 1/1 0/1
2 chr1 102 . T A . . AF=0.167 GT 0/1 0/0 0/0
Actually apply the filtering.
>>> i = A_vf.extract_info('#AF') > 0.2
>>> filtered_vf = pyvcf.VcfFrame(vf.copy_meta(), vf.df[i])
>>> filtered_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A1 A2 A3 B1 B2
0 chr1 100 . G A . . . GT 0/1 0/0 0/1 0/0 0/0
1 chr1 101 . T C . . . GT 0/1 1/1 0/1 0/1 1/1
Optionally write output VCF.
# filtered_vf.to_file('out.vcf')
Let me know if you have any questions.
Hi Raphael,
Thank you very much for this solution, I will test and reply again if there are issues.
Dean