Entering edit mode
2.8 years ago
ManuelDB
▴
110
I am working with the genomics England research env. and the number of tools is very limited. I have imported a VCF into a pandas data frame but now I need to get the genes names and gene ID from the column 'INFO'. I would like to have this info in new columns as shown below.
The info column
INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|gene_name|ESNT12345|,1|gene_name|ESNT12345|,1|gene_name|ESNT12345|...
The output I would like to have (this is just one possibility)
ene1 gene2 gene3
Na Na Na
gene_name|ESNT12345 Na Na
gene_name|ESNT12345 gene_name|ESNT12345 Na
gene_name|ESNT12345 gene_name|ESNT12345 gene_name|ESNT12345
Any suggestion using pandas or any python module that does that and hopefully I have that.
Excellent. Thanks I have tried it and work. I have seen the list and the pandas generated with "genes". What I dont understand is how to add this in their respectively columns because I don't see the relation of the genes with the number of [] generated
Not sure I understood the question. In the code above, the columns of the output DF correspond to the
gene1,gene2,gene3
columns of your example output.The condition
corresponds to the case where the CSQT tag is not present in the INFO field. Since there is no gene name associated to the record, we just create an empty row.