Question

012 matrix variation for multiple alleles

0

Entering edit mode

3.5 years ago

shahar • 0

Hi,

I am using VcfTools to parser VCF files.

I can use it to generate a 012 matrix. This matrix is 2D, with the shape of (num of individuals, num of SNPs).

In each cell in the matrix, there is the number of occurrences, of the alternative allele for the specific SNP in a specific individual.

This is great for a biallelic data - means for every SNP there is a single allele.

In my case there are at most n alleles per SNP,and I would like to have n matrices, each one is for an "allele index" and it specify in each cell how many occurrences of that allele are, in the specific Individual in the specific SNP.

Does anyone familiar with a tool that can provide that?

Thanks

vcftools vcf • 2.7k views

ADD COMMENT • link updated 2.4 years ago by sbstevenlee ▴ 480 • written 3.5 years ago by shahar • 0

score 2 · Accepted Answer · 2021-12-26

2

Entering edit mode

3.5 years ago

sbstevenlee ▴ 480

If you are a Python user, below is a Python API solution using the pyvcf submodule from the fuc package I wrote:

Assume you have the following data:

>>> import pandas as pd
>>> from fuc import pyvcf
>>> data = {
...     'CHROM': ['chr1', 'chr1'],
...     'POS': [100, 101],
...     'ID': ['.', '.'],
...     'REF': ['G', 'T'],
...     'ALT': ['A', 'C,A'],
...     'QUAL': ['.', '.'],
...     'FILTER': ['.', '.'],
...     'INFO': ['.', '.'],
...     'FORMAT': ['GT', 'GT'],
...     'A': ['0/1', '0/1'],
...     'B': ['0/0', '0/2'],
...     'C': ['1/1', '1/2'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> vf.df
  CHROM  POS ID REF  ALT QUAL FILTER INFO FORMAT    A    B    C
0  chr1  100  .   G    A    .      .    .     GT  0/1  0/0  1/1
1  chr1  101  .   T  C,A    .      .    .     GT  0/1  0/2  1/2

We can compute the desired matrix with:

>>> variants = {}
>>> for i, r in vf.df.iterrows():
...     variant = f'{r.CHROM}-{r.POS}-{r.REF}-{r.REF}'
...     variants[variant] = r[9:].apply(lambda x: x.split('/').count(str(0)))
...     for j, allele in enumerate(r.ALT.split(',')):
...         variant = f'{r.CHROM}-{r.POS}-{r.REF}-{allele}'
...         variants[variant] = r[9:].apply(lambda x: x.split('/').count(str(j+1)))
... 
>>> df = pd.DataFrame(variants).T
>>> df
              A  B  C
chr1-100-G-G  1  2  0
chr1-100-G-A  1  0  2
chr1-101-T-T  1  1  0
chr1-101-T-C  1  0  1
chr1-101-T-A  0  1  1

You can easily read and write VCF files:

vf = pyvcf.VcfFrame.from_file('in.vcf')
vf.to_file('out.vcf')

ADD COMMENT • link 3.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

I want to ask: what mean 1? 0 and 2 in the last df ? also gt how changed after split alt for example 1/2 how will be changed after split alt?

ADD REPLY • link 2.5 years ago by arwa.ahmad95 • 0

0

Entering edit mode

The numbers in the last df represent allele count. For instance, since the A sample is heterozygous for chr1-100-G-A the sample has 1 and so on.

I'm not sure if I understand your second question. If you are asking about multiallelic loci, they will be split into multiple rows. For instance, the position chr1-101 has two alternative alleles but in the last df they were split into three rows (two alt + one ref).

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

the last df represent allele count for reference? the second question yes I know that split the multiallelic but 1/2 is changed to what? 0/1 then it's count the allel? .. Also, I want to ask what is mean the GG or TT in last df? also why not split only multiallelic and keep bi allel as it self?

ADD REPLY • link 2.5 years ago by arwa.ahmad95 • 0

0

Entering edit mode

the last df represent allele count for reference?

No, they represent the counts for both reference and alternative alleles. Continuing to use the A sample as an example, it is heterozygous for chr1-100-G-A. Therefore, it has an count of 1 for chr1-100-G-A (alt) and 1 for chr1-100-G-G (ref).

I know that split the multiallelic but 1/2 is changed to what? 0/1 then it's count the allel?

Yes.

also why not split only multiallelic and keep bi allel as it self?

That depends on what your end goal is. When I first developed this function, I wanted to quickly get allele counts for both alt and ref alleles. You can easily modify the above code to extract allele counts only for alt if that's what you want.

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

Also , I want to ask the fuc package work on windows and python? I can't download it.

ADD REPLY • link 2.5 years ago by arwa.ahmad95 • 0

0

Entering edit mode

It should work in Windows too. It's just a Python library. If you can use popular libraries like panda and numpy, you should be able to use fuc as well. I recommend using conda to install fuc.

ADD REPLY • link 2.5 years ago by sbstevenlee ▴ 480

0

Entering edit mode

I'm using coda but the problem I think that need (pysam). And I can't download it until now.

ADD REPLY • link 2.4 years ago by arwa.ahmad95 • 0

0

Entering edit mode

Try this:

$ conda create -n fuc -c bioconda fuc=0.36.0 pysam=0.20.0
$ conda activate fuc
$ fuc -h

ADD REPLY • link 2.4 years ago by sbstevenlee ▴ 480