Hello Biostars, I used to use python but I'm not good at it thus I need your help. I'm trying parsing a tab file resulting from the functional annotation using EggNOG-mapper. here.
I want to explore all the essential information and make some plots from this data.
what I did as far is read the file as Datafaram
import pandas as pd
adding headers
data =pd.read_csv("test.emapper.annotations.tab",delimiter="\t", names=["query","seed_ortholog","evalue","score",
"eggNOG_OGs","max_annot_lvl","COG_category","Description",
'Preferred_name','GOs','EC','KEGG_ko','KEGG_Pathway','KEGG_Module',
'KEGG_Reaction','KEGG_rclass','BRITE','KEGG_TC','CAZy','BiGG_Reaction','PFAMs']);
data =data.rename(columns = {"Preferred_name": "Gene"})
df=data[["seed_ortholog","evalue","Description","Gene","GOs"]]
df= df.sort_values("evalue", ascending= False).reset_index(drop= True)
df
seed_ortholog evalue Description Gene GOs
0 9031.ENSGALP00000019831 0.000991 C-mannosyltransferase DPY19L1 DPY19L1 GO:0000030,GO:0003674,GO:0003824,GO:0005575,GO...
1 8081.XP_008421935.1 0.000975 Ring finger protein RNF165 GO:0000209,GO:0000902,GO:0000904,GO:0003674,GO...
2 118797.XP_007471345.1 0.000964 Neural Wiskott-Aldrich syndrome protein WASL GO:0000003,GO:0000139,GO:0000226,GO:0000768,GO...
3 31234.CRE04397 0.000905 Belongs to the DHHC palmitoyltransferase family ZDHHC13 GO:0000003,GO:0000139,GO:0001505,GO:0002791,GO...
4 38654.XP_006036177.1 0.000870 protein C6orf47 homolog C6orf47 GO:0005575,GO:0005622,GO:0005623,GO:0005737,GO...
Fist I need help with the "seed_ortholog" I want to split it into two columns ("taxid ", "term_name") as you can see each string compose of two parts the prefix from the example is the taxonomy id and the annotations id most of them ensembl ids of NCBI gene id. ..
I tried to do
data["seed_ortholog"].str.split(".", n = 1, expand = False)
out
> 0 [9031, ENSGALP00000019831]
> 1 [8081, XP_008421935.1]
> 2 [118797, XP_007471345.1]
> 3 [31234, CRE04397]
> 4 [38654, XP_006036177.1]
> ...
> 11566 [52644, XP_010569810.1]
> 11567 [52644, XP_010579726.1]
> 11568 NaN
> 11569 NaN
> 11570 NaN
> Name: seed_ortholog, Length: 11571, dtype: object
any help regarding this small issue. Thank you
With what exactly do you need help? The output of your scripts seems to be as expected. It just so happens that several rows at the end have NaN values in
seed_ortholog
column. This is not because of your script.