Question

Filter vcf file based on the GTs of a single sample

0

Entering edit mode

3.3 years ago

miguellarrazlopezdenovales ▴ 20

Hi, I have a multisample vcf, and I am trying to keep only those variants that are biallelic and homozygous in one of the samples.How can I achieve this and also keep the GTs for the other samples at the position in which this condition is true.

bcftools vcf samtools bcf • 918 views

ADD COMMENT • link updated 3.3 years ago by sbstevenlee ▴ 480 • written 3.3 years ago by miguellarrazlopezdenovales ▴ 20

score 1 · Answer 1 · 2021-07-26

1

Entering edit mode

3.3 years ago

Pierre Lindenbaum 164k

use GATK SelectVariants https://gatk.broadinstitute.org/hc/en-us/articles/360037055952-SelectVariants with a JEXL expression like

  -select 'vc.getGenotype("NA12878").isHomVar()'

ADD COMMENT • link 3.3 years ago by Pierre Lindenbaum 164k

score 1 · Answer 2 · 2021-07-26

Here's a Python API solution using the pyvcf submodule I wrote:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr1', 'chr1', 'chr1'],
...     'POS': [100, 101, 102, 103, 104],
...     'ID': ['.', '.', '.', '.', '.'],
...     'REF': ['A', 'A', 'C', 'C', 'T'],
...     'ALT': ['C,T', 'T', 'G', 'G,A', 'A'],
...     'QUAL': ['.', '.', '.', '.', '.'],
...     'FILTER': ['.', '.', '.', '.', '.'],
...     'INFO': ['.', '.', '.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT', 'GT', 'GT'],
...     'A': ['0/2', '0/0', '0/1', './.', '0/1'],
...     'B': ['0/1', '1/1', './.', '1/2', '1/1'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('in.vcf')
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  100  .   A  C,T    .      .    .     GT  0/2  0/1
1  chr1  101  .   A    T    .      .    .     GT  0/0  1/1
2  chr1  102  .   C    G    .      .    .     GT  0/1  ./.
3  chr1  103  .   C  G,A    .      .    .     GT  ./.  1/2
4  chr1  104  .   T    A    .      .    .     GT  0/1  1/1
>>> # Remove multiallelic variants
>>> vf = vf.filter_multialt()
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  1/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.
2  chr1  104  .   T   A    .      .    .     GT  0/1  1/1
>>> # Select only variants where the sample B is homozygous
>>> def one_row(r):
...     return r['B'].split('/')[0] == r['B'].split('/')[1]
... 
>>> i = vf.df.apply(one_row, axis=1)
>>> vf.df = vf.df[i]
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT    A    B
0  chr1  101  .   A   T    .      .    .     GT  0/0  1/1
1  chr1  102  .   C   G    .      .    .     GT  0/1  ./.
2  chr1  104  .   T   A    .      .    .     GT  0/1  1/1
>>> vf.to_file('out.vcf')