Question

vcf file per-sample comparison in a region

0

Entering edit mode

3.2 years ago

C4 ▴ 30

Hi,

I have a multi-sample vcf file. I would like to check variants in a specific region ChrX:100000-1000200 (for example) present per-sample basis. How can I do this?

Also, if I had to calculate variant frequency per group (all the samples in vcf can be categorized into two groups, for example - male vs female) - what is the recommended way of doing this?

Thank you in advance.

SNP next-gen • 857 views

ADD COMMENT • link updated 3.2 years ago by sbstevenlee ▴ 480 • written 3.2 years ago by C4 ▴ 30

score 0 · Answer 1 · 2022-02-14

Below are Python solutions using the pyvcf submodule from the fuc package I wrote.

Assume you have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2'],
...     'POS': [100, 101, 500],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Male1': ['0/1', '0/0', '0/0'],
...     'Male2': ['1/1', '0/1', '0/0'],
...     'Female1': ['0/0', '0/0', '0/0'],
...     'Female2': ['0/0', '0/0', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0  chr1  100  .   G   A    .      .    .     GT   0/1   1/1     0/0     0/0
1  chr1  101  .   T   C    .      .    .     GT   0/0   0/1     0/0     0/0
2  chr2  500  .   A   T    .      .    .     GT   0/0   0/0     0/0     0/1

You can 'slice' your VCF for given region(s):

>>> vf = vf.slice('chr1:100-200')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0  chr1  100  .   G   A    .      .    .     GT   0/1   1/1     0/0     0/0
1  chr1  101  .   T   C    .      .    .     GT   0/0   0/1     0/0     0/0

In your case, you would probably want to simultaneously import and slice the VCF:

# vf = pyvcf.VcfFrame.from_file('in.vcf', regions='chr1:100-200')

As for getting allele frequency of a variant for different sample groups:

>>> import pandas as pd
>>> 
>>> male_vf = vf.subset(['Male1', 'Male2'])
>>> female_vf = vf.subset(['Female1', 'Female2'])
>>> 
>>> def one_gt(g):
...     return sum([int(x) for x in g.split('/')])
... 
>>> def one_row(r):
...     variant = f'{r.CHROM}-{r.POS}-{r.REF}-{r.ALT}'
...     frequency = sum(r[9:].apply(one_gt)) / (len(r[9:]) * 2)
...     return pd.Series([variant, frequency])
... 
>>> male_df = male_vf.df.apply(one_row, axis=1)
>>> female_df = female_vf.df.apply(one_row, axis=1)
>>> 
>>> male_df.columns = ['Variant', 'Male']
>>> female_df.columns = ['Variant', 'Female']
>>> 
>>> male_df.merge(female_df, on='Variant')
        Variant  Male  Female
0  chr1-100-G-A  0.75     0.0
1  chr1-101-T-C  0.25     0.0

Let me know if you need further help.