vcf file per-sample comparison in a region
1
0
Entering edit mode
2.8 years ago
C4 ▴ 30

Hi,

I have a multi-sample vcf file. I would like to check variants in a specific region ChrX:100000-1000200 (for example) present per-sample basis. How can I do this?

Also, if I had to calculate variant frequency per group (all the samples in vcf can be categorized into two groups, for example - male vs female) - what is the recommended way of doing this?

Thank you in advance.

SNP next-gen • 763 views
ADD COMMENT
0
Entering edit mode
2.8 years ago
sbstevenlee ▴ 480

Below are Python solutions using the pyvcf submodule from the fuc package I wrote.

Assume you have the following data:

>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1', 'chr2'],
...     'POS': [100, 101, 500],
...     'ID': ['.', '.', '.'],
...     'REF': ['G', 'T', 'A'],
...     'ALT': ['A', 'C', 'T'],
...     'QUAL': ['.', '.', '.'],
...     'FILTER': ['.', '.', '.'],
...     'INFO': ['.', '.', '.'],
...     'FORMAT': ['GT', 'GT', 'GT'],
...     'Male1': ['0/1', '0/0', '0/0'],
...     'Male2': ['1/1', '0/1', '0/0'],
...     'Female1': ['0/0', '0/0', '0/0'],
...     'Female2': ['0/0', '0/0', '0/1']
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0  chr1  100  .   G   A    .      .    .     GT   0/1   1/1     0/0     0/0
1  chr1  101  .   T   C    .      .    .     GT   0/0   0/1     0/0     0/0
2  chr2  500  .   A   T    .      .    .     GT   0/0   0/0     0/0     0/1

You can 'slice' your VCF for given region(s):

>>> vf = vf.slice('chr1:100-200')
>>> vf.df
  CHROM  POS ID REF ALT QUAL FILTER INFO FORMAT Male1 Male2 Female1 Female2
0  chr1  100  .   G   A    .      .    .     GT   0/1   1/1     0/0     0/0
1  chr1  101  .   T   C    .      .    .     GT   0/0   0/1     0/0     0/0

In your case, you would probably want to simultaneously import and slice the VCF:

# vf = pyvcf.VcfFrame.from_file('in.vcf', regions='chr1:100-200')

As for getting allele frequency of a variant for different sample groups:

>>> import pandas as pd
>>> 
>>> male_vf = vf.subset(['Male1', 'Male2'])
>>> female_vf = vf.subset(['Female1', 'Female2'])
>>> 
>>> def one_gt(g):
...     return sum([int(x) for x in g.split('/')])
... 
>>> def one_row(r):
...     variant = f'{r.CHROM}-{r.POS}-{r.REF}-{r.ALT}'
...     frequency = sum(r[9:].apply(one_gt)) / (len(r[9:]) * 2)
...     return pd.Series([variant, frequency])
... 
>>> male_df = male_vf.df.apply(one_row, axis=1)
>>> female_df = female_vf.df.apply(one_row, axis=1)
>>> 
>>> male_df.columns = ['Variant', 'Male']
>>> female_df.columns = ['Variant', 'Female']
>>> 
>>> male_df.merge(female_df, on='Variant')
        Variant  Male  Female
0  chr1-100-G-A  0.75     0.0
1  chr1-101-T-C  0.25     0.0

Let me know if you need further help.

ADD COMMENT

Login before adding your answer.

Traffic: 1964 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6