Question

How to extract gene name and transcript name from the INFO column of a VCF file

0

Entering edit mode

2.8 years ago

ManuelDB ▴ 110

I am working with the genomics England research env. and the number of tools is very limited. I have imported a VCF into a pandas data frame but now I need to get the genes names and gene ID from the column 'INFO'. I would like to have this info in new columns as shown below.

The info column

INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|,1|gene_name|ESNT12345|...

The output I would like to have (this is just one possibility)

ene1               gene2               gene3
Na                  Na                  Na
gene_name|ESNT12345 Na                  Na
gene_name|ESNT12345 gene_name|ESNT12345 Na
gene_name|ESNT12345 gene_name|ESNT12345 gene_name|ESNT12345

Any suggestion using pandas or any python module that does that and hopefully I have that.

pandas VCF • 1.3k views

ADD COMMENT • link updated 2.8 years ago by raphael.B ▴ 520 • written 2.8 years ago by ManuelDB ▴ 110

score 1 · Accepted Answer · 2022-02-15

1

Entering edit mode

2.8 years ago

raphael.B ▴ 520

Hi there! This should do the trick. You can then paste the resulting df to your data.

import pandas as pd


info=["SVTYPE=CNV;END=401233",
"SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|",
"SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|",
"SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|,1|gene_name|ESNT12345"]

genes=[]
for variant in info:
    # search for CSQT tag
    line=variant.split(';')
    found_gene=False
    for tag in line:
        # if CSQT is found add it to gene
        if "CSQT" in tag:
            genes.append(tag.replace('CSQT=','').split(','))
            found_gene=True
    # if not found add empty row
    if not found_gene:
        genes.append([])

print(pd.DataFrame(genes))

ADD COMMENT • link 2.8 years ago by raphael.B ▴ 520

0

Entering edit mode

Excellent. Thanks I have tried it and work. I have seen the list and the pandas generated with "genes". What I dont understand is how to add this in their respectively columns because I don't see the relation of the genes with the number of [] generated

ADD REPLY • link 2.8 years ago by ManuelDB ▴ 110

0

Entering edit mode

Not sure I understood the question. In the code above, the columns of the output DF correspond to the gene1,gene2,gene3 columns of your example output.

The condition

if not found_gene:
        genes.append([])

corresponds to the case where the CSQT tag is not present in the INFO field. Since there is no gene name associated to the record, we just create an empty row.

ADD REPLY • link 2.8 years ago by raphael.B ▴ 520