Parser EggNOG-mapper annotation output using pythom
0
0
Entering edit mode
20 months ago
ben@f ▴ 20

Hello Biostars, I used to use python but I'm not good at it thus I need your help. I'm trying parsing a tab file resulting from the functional annotation using EggNOG-mapper. here.

I want to explore all the essential information and make some plots from this data.

what I did as far is read the file as Datafaram

import pandas as pd

adding headers

data =pd.read_csv("test.emapper.annotations.tab",delimiter="\t",  names=["query","seed_ortholog","evalue","score",
                 "eggNOG_OGs","max_annot_lvl","COG_category","Description",
                 'Preferred_name','GOs','EC','KEGG_ko','KEGG_Pathway','KEGG_Module',
                 'KEGG_Reaction','KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG_Reaction','PFAMs']);


data =data.rename(columns = {"Preferred_name": "Gene"})
df=data[["seed_ortholog","evalue","Description","Gene","GOs"]]
df= df.sort_values("evalue", ascending= False).reset_index(drop= True)
df
    seed_ortholog   evalue  Description Gene    GOs
0   9031.ENSGALP00000019831 0.000991    C-mannosyltransferase DPY19L1   DPY19L1 GO:0000030,GO:0003674,GO:0003824,GO:0005575,GO...
1   8081.XP_008421935.1 0.000975    Ring finger protein RNF165  GO:0000209,GO:0000902,GO:0000904,GO:0003674,GO...
2   118797.XP_007471345.1   0.000964    Neural Wiskott-Aldrich syndrome protein WASL    GO:0000003,GO:0000139,GO:0000226,GO:0000768,GO...
3   31234.CRE04397  0.000905    Belongs to the DHHC palmitoyltransferase family ZDHHC13 GO:0000003,GO:0000139,GO:0001505,GO:0002791,GO...
4   38654.XP_006036177.1    0.000870    protein C6orf47 homolog C6orf47 GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO...

Fist I need help with the "seed_ortholog" I want to split it into two columns ("taxid ", "term_name") as you can see each string compose of two parts the prefix from the example is the taxonomy id and the annotations id most of them ensembl ids of NCBI gene id. ..

I tried to do

data["seed_ortholog"].str.split(".", n = 1, expand = False)

out

>     0        [9031, ENSGALP00000019831]
>     1            [8081, XP_008421935.1]
>     2          [118797, XP_007471345.1]
>     3                 [31234, CRE04397]
>     4           [38654, XP_006036177.1]
>                         ...            
>     11566       [52644, XP_010569810.1]
>     11567       [52644, XP_010579726.1]
>     11568                           NaN
>     11569                           NaN
>     11570                           NaN
>     Name: seed_ortholog, Length: 11571, dtype: object

any help regarding this small issue. Thank you

EggNOG-mapper Python • 1.2k views
ADD COMMENT
0
Entering edit mode

With what exactly do you need help? The output of your scripts seems to be as expected. It just so happens that several rows at the end have NaN values in seed_ortholog column. This is not because of your script.

ADD REPLY

Login before adding your answer.

Traffic: 2397 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6