Question

Parser EggNOG-mapper annotation output using pythom

0

Entering edit mode

2.4 years ago

ben@f ▴ 20

Hello Biostars, I used to use python but I'm not good at it thus I need your help. I'm trying parsing a tab file resulting from the functional annotation using EggNOG-mapper. here.

I want to explore all the essential information and make some plots from this data.

what I did as far is read the file as Datafaram

import pandas as pd

adding headers

data =pd.read_csv("test.emapper.annotations.tab",delimiter="\t",  names=["query","seed_ortholog","evalue","score",
                 "eggNOG_OGs","max_annot_lvl","COG_category","Description",
                 'Preferred_name','GOs','EC','KEGG_ko','KEGG_Pathway','KEGG_Module',
                 'KEGG_Reaction','KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG_Reaction','PFAMs']);


data =data.rename(columns = {"Preferred_name": "Gene"})
df=data[["seed_ortholog","evalue","Description","Gene","GOs"]]
df= df.sort_values("evalue", ascending= False).reset_index(drop= True)
df
    seed_ortholog   evalue  Description Gene    GOs
0   9031.ENSGALP00000019831 0.000991    C-mannosyltransferase DPY19L1   DPY19L1 GO:0000030,GO:0003674,GO:0003824,GO:0005575,GO...
1   8081.XP_008421935.1 0.000975    Ring finger protein RNF165  GO:0000209,GO:0000902,GO:0000904,GO:0003674,GO...
2   118797.XP_007471345.1   0.000964    Neural Wiskott-Aldrich syndrome protein WASL    GO:0000003,GO:0000139,GO:0000226,GO:0000768,GO...
3   31234.CRE04397  0.000905    Belongs to the DHHC palmitoyltransferase family ZDHHC13 GO:0000003,GO:0000139,GO:0001505,GO:0002791,GO...
4   38654.XP_006036177.1    0.000870    protein C6orf47 homolog C6orf47 GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO...

Fist I need help with the "seed_ortholog" I want to split it into two columns ("taxid ", "term_name") as you can see each string compose of two parts the prefix from the example is the taxonomy id and the annotations id most of them ensembl ids of NCBI gene id. ..

I tried to do

data["seed_ortholog"].str.split(".", n = 1, expand = False)

out

>     0        [9031, ENSGALP00000019831]
>     1            [8081, XP_008421935.1]
>     2          [118797, XP_007471345.1]
>     3                 [31234, CRE04397]
>     4           [38654, XP_006036177.1]
>                         ...            
>     11566       [52644, XP_010569810.1]
>     11567       [52644, XP_010579726.1]
>     11568                           NaN
>     11569                           NaN
>     11570                           NaN
>     Name: seed_ortholog, Length: 11571, dtype: object

any help regarding this small issue. Thank you

EggNOG-mapper Python • 1.7k views

ADD COMMENT • link updated 2.4 years ago by Mensur Dlakic ★ 29k • written 2.4 years ago by ben@f ▴ 20

0

Entering edit mode

With what exactly do you need help? The output of your scripts seems to be as expected. It just so happens that several rows at the end have NaN values in seed_ortholog column. This is not because of your script.

ADD REPLY • link 2.4 years ago by Mensur Dlakic ★ 29k