If you are a Python user, below is a Python API solution using the pyvcf
submodule from the fuc
package I wrote:
Assume you have the following data:
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102, 103],
... 'ID': ['.', '.', '.', '.'],
... 'REF': ['G', 'T', 'T', 'C'],
... 'ALT': ['A', 'C', 'A', 'A'],
... 'QUAL': ['.', '.', '.', '.'],
... 'FILTER': ['.', '.', '.', '.'],
... 'INFO': ['.', '.', '.', '.'],
... 'FORMAT': ['GT', 'GT', 'GT', 'GT'],
... 'Sample1': ['0/1', '0/0', '0/0', '0/1'],
... 'Sample2': ['1/1', '0/0', '0/1', '0/1'],
... 'Sample3': ['0/1', '0/0', '0/0', '0/1'],
... 'Sample4': ['0/0', '0/0', '0/0', '0/1'],
... 'Sample5': ['0/0', '0/1', '0/1', '0/0'],
... 'Sample6': ['0/0', '0/0', '0/0', '0/0'],
... 'Sample7': ['0/0', '0/0', '1/1', '0/0'],
... 'Sample8': ['0/0', '0/0', '0/1', '0/0'],
... 'Sample9': ['0/0', '0/0', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0 chr1 100 . G A . . . GT 0/1 1/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0
1 chr1 101 . T C . . . GT 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0
2 chr1 102 . T A . . . GT 0/0 0/1 0/0 0/0 0/1 0/0 1/1 0/1 0/1
3 chr1 103 . C A . . . GT 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0
If you want to extract variants that are only found in Sample1, Sample2, and Sample3, you can do the following:
>>> # First select variants that appear in all three of the samples
>>> vf = vf.filter_sampall(samples=['Sample1', 'Sample2', 'Sample3'])
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0 chr1 100 . G A . . . GT 0/1 1/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0
1 chr1 103 . C A . . . GT 0/1 0/1 0/1 0/1 0/0 0/0 0/0 0/0 0/0
>>> # Then remove any variants that are not unique to the three samples
>>> vf = vf.filter_sampany(samples=['Sample4', 'Sample5', 'Sample6', 'Sample7', 'Sample8', 'Sample9'], opposite=True)
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0 chr1 100 . G A . . . GT 0/1 1/1 0/1 0/0 0/0 0/0 0/0 0/0 0/0
You can easily read and write VCF files:
vf = pyvcf.VcfFrame.from_file('in.vcf')
vf.to_file('out.vcf')
Let me know if you have any questions.
see same questions : Filtering of rare variants ; Shared variants ; ...