Finding variants common to subset of samples in multi-sample VCF
1
0
Entering edit mode
2.9 years ago
dmp • 0

I have a VCF file generated from following the GATK tutorial and using the GenotypeGVCFs to jointly genotype each of my 9 samples. I would like to extract SNPs that are unique to 3 of the 9 samples, as in common variants that are only found in those 3 samples.

Does anyone have an example for how to do this? I've tried reading through SNPsift and SelectVariants in GATK, but I can only see how to find SNPs unique to each sample, rather than a subset of samples. Any help would be greatly appreciated!

variant SNP vcf • 2.0k views
ADD COMMENT
0
Entering edit mode

see same questions : Filtering of rare variants ; Shared variants ; ...

ADD REPLY
0
Entering edit mode
2.9 years ago
sbstevenlee ▴ 480

If you are a Python user, below is a Python API solution using the pyvcf submodule from the fuc package I wrote:

Assume you have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103],
...     'ID': ['.', '.', '.', '.'],
...     'REF': ['G', 'T', 'T', 'C'],
...     'ALT': ['A', 'C', 'A', 'A'],
...     'QUAL': ['.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT'],
...     'Sample1': ['0/1', '0/0', '0/0', '0/1'],
...     'Sample2': ['1/1', '0/0', '0/1', '0/1'],
...     'Sample3': ['0/1', '0/0', '0/0', '0/1'],
...     'Sample4': ['0/0', '0/0', '0/0', '0/1'],
...     'Sample5': ['0/0', '0/1', '0/1', '0/0'],
...     'Sample6': ['0/0', '0/0', '0/0', '0/0'],
...     'Sample7': ['0/0', '0/0', '1/1', '0/0'],
...     'Sample8': ['0/0', '0/0', '0/1', '0/0'],
...     'Sample9': ['0/0', '0/0', '0/1', '0/0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0  chr1  100  .   G   A    .      .    .     GT     0/1     1/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0
1  chr1  101  .   T   C    .      .    .     GT     0/0     0/0     0/0     0/0     0/1     0/0     0/0     0/0     0/0
2  chr1  102  .   T   A    .      .    .     GT     0/0     0/1     0/0     0/0     0/1     0/0     1/1     0/1     0/1
3  chr1  103  .   C   A    .      .    .     GT     0/1     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/0

If you want to extract variants that are only found in Sample1, Sample2, and Sample3, you can do the following:

>>> # First select variants that appear in all three of the samples
>>> vf = vf.filter_sampall(samples=['Sample1', 'Sample2', 'Sample3'])
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0  chr1  100  .   G   A    .      .    .     GT     0/1     1/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0
1  chr1  103  .   C   A    .      .    .     GT     0/1     0/1     0/1     0/1     0/0     0/0     0/0     0/0     0/0
>>> # Then remove any variants that are not unique to the three samples
>>> vf = vf.filter_sampany(samples=['Sample4', 'Sample5', 'Sample6', 'Sample7', 'Sample8', 'Sample9'], opposite=True)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9
0  chr1  100  .   G   A    .      .    .     GT     0/1     1/1     0/1     0/0     0/0     0/0     0/0     0/0     0/0

You can easily read and write VCF files:

vf = pyvcf.VcfFrame.from_file('in.vcf')
vf.to_file('out.vcf')

Let me know if you have any questions.

ADD COMMENT

Login before adding your answer.

Traffic: 1807 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6