Hello, I currently have gff and gbk formatted files with functional annotations for several contigs. I have sometimes 20 or more annotations per CDS because each annotation corresponds to an alignment with phmmer.
Some annotations are similar but nonetheless non-redundant (i.e. annot6=(hmm search - phmmer) fig|268746.6.peg.87 [Phage protein] [ACLAME_Phage_proteins_with_unknown_functions| Phage_cyanophage| Phage_experimental] [268746.6] [Prochlorococcus phage P-SSM2.];annot7=(hmm search - phmmer) fig|1204516.3.peg.65 [Phage protein] [ACLAME_Phage_proteins_with_unknown_functions| Phage_cyanophage| Phage_experimental] [1204516.3] [Aeromonas phage CC2]
)
And then some annotations are very different because they were inferred with different tools (i.e. annot9=(hmm search - phmmer) sp|P46671|FAR3_YEAST Factor arrest protein 3 OS;phyre2=Ribonuclease PH domain
).
Is there a program somewhere that is able to infer a consensus function for each CDS given the multiple putative functions? Or at least narrow it down? I know this is a tricky ask but I won't be able to manually curate the annotations because there are thousands of different coding sequences. Any advice would be greatly appreciated!
EDIT: Just in case, here's a full example for the monstrosity of all annotations for one gene:
VMAG_158.1-circular PhATE CDS 21295 21513 . + . ID=VMAG_158.1-circular_superset_572_geneCall_cds;annot1=(hmm search - phmmer) gi|1109302444|ref|YP_009325570.1| hypothetical protein [Only Syngen Nebraska virus 5];annot10=(hmm search - phmmer) sp|A0RQZ6|KHSE_CAMFF Homoserine kinase OS;annot11=(hmm search - phmmer) sp|Q796P5|YITY_BACSU Uncharacterized FAD-linked oxidoreductase YitY OS;annot12=(hmm search - phmmer) AMV37685.1|GH130| | beta-1,4-mannosylglucose phosphorylase (EC 2.4.1.281)| beta-1,4-mannooligosaccharide phosphorylase (EC 2.4.1.319)| beta-1,4-mannosyl-N-acetyl-glucosamine phosphorylase (EC 2.4.1.320)| beta-1,2-mannobiose phosphorylase (EC 2.4.1.-)| beta-1,2-oligomannan phosphorylase (EC 2.4.1.-)| beta-1,2-mannosidase (EC 3.2.1.-);annot13=(hmm search - phmmer) CBL07590.1|GT2| | cellulose synthase (EC 2.4.1.12)| chitin synthase (EC 2.4.1.16)| dolichyl-phosphate beta-D-mannosyltransferase (EC 2.4.1.83)| dolichyl-phosphate beta-glucosyltransferase (EC 2.4.1.117)| N-acetylglucosaminyltransferase (EC 2.4.1.-)| N-acetylgalactosaminyltransferase (EC 2.4.1.-)| hyaluronan synthase (EC 2.4.1.212)| chitin oligosaccharide synthase (EC 2.4.1.-)| beta-1,3-glucan synthase (EC 2.4.1.34)| beta-1,4-mannan synthase (EC 2.4.1.-)| beta-mannosylphosphodecaprenol-mannooligosaccharide alpha-1,6-mannosyltransferase (EC 2.4.1.199)| UDP-Galf: rhamnopyranosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,4/1,5-galactofuranosyltransferase (EC 2.4.1.287)| UDP-Galf: galactofuranosyl-galactofuranosyl-rhamnosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,5/1,6-galactofuranosyltransferase (EC 2.4.1.288)| dTDP-L-Rha: N-acetylglucosaminyl-PP-decaprenol alpha-1,3-L-rhamnosyltransferase (EC 2.4.1.289);annot14=(hmm search - phmmer) CBL14430.1|GT2| | cellulose synthase (EC 2.4.1.12)| chitin synthase (EC 2.4.1.16)| dolichyl-phosphate beta-D-mannosyltransferase (EC 2.4.1.83)| dolichyl-phosphate beta-glucosyltransferase (EC 2.4.1.117)| N-acetylglucosaminyltransferase (EC 2.4.1.-)| N-acetylgalactosaminyltransferase (EC 2.4.1.-)| hyaluronan synthase (EC 2.4.1.212)| chitin oligosaccharide synthase (EC 2.4.1.-)| beta-1,3-glucan synthase (EC 2.4.1.34)| beta-1,4-mannan synthase (EC 2.4.1.-)| beta-mannosylphosphodecaprenol-mannooligosaccharide alpha-1,6-mannosyltransferase (EC 2.4.1.199)| UDP-Galf: rhamnopyranosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,4/1,5-galactofuranosyltransferase (EC 2.4.1.287)| UDP-Galf: galactofuranosyl-galactofuranosyl-rhamnosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,5/1,6-galactofuranosyltransferase (EC 2.4.1.288)| dTDP-L-Rha: N-acetylglucosaminyl-PP-decaprenol alpha-1,3-L-rhamnosyltransferase (EC 2.4.1.289);annot15=(hmm search - phmmer) VCV20559.1|GT2| | cellulose synthase (EC 2.4.1.12)| chitin synthase (EC 2.4.1.16)| dolichyl-phosphate beta-D-mannosyltransferase (EC 2.4.1.83)| dolichyl-phosphate beta-glucosyltransferase (EC 2.4.1.117)| N-acetylglucosaminyltransferase (EC 2.4.1.-)| N-acetylgalactosaminyltransferase (EC 2.4.1.-)| hyaluronan synthase (EC 2.4.1.212)| chitin oligosaccharide synthase (EC 2.4.1.-)| beta-1,3-glucan synthase (EC 2.4.1.34)| beta-1,4-mannan synthase (EC 2.4.1.-)| beta-mannosylphosphodecaprenol-mannooligosaccharide alpha-1,6-mannosyltransferase (EC 2.4.1.199)| UDP-Galf: rhamnopyranosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,4/1,5-galactofuranosyltransferase (EC 2.4.1.287)| UDP-Galf: galactofuranosyl-galactofuranosyl-rhamnosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,5/1,6-galactofuranosyltransferase (EC 2.4.1.288)| dTDP-L-Rha: N-acetylglucosaminyl-PP-decaprenol alpha-1,3-L-rhamnosyltransferase (EC 2.4.1.289);annot16=(hmm search - phmmer) QAT42016.1|GT2| | cellulose synthase (EC 2.4.1.12)| chitin synthase (EC 2.4.1.16)| dolichyl-phosphate beta-D-mannosyltransferase (EC 2.4.1.83)| dolichyl-phosphate beta-glucosyltransferase (EC 2.4.1.117)| N-acetylglucosaminyltransferase (EC 2.4.1.-)| N-acetylgalactosaminyltransferase (EC 2.4.1.-)| hyaluronan synthase (EC 2.4.1.212)| chitin oligosaccharide synthase (EC 2.4.1.-)| beta-1,3-glucan synthase (EC 2.4.1.34)| beta-1,4-mannan synthase (EC 2.4.1.-)| beta-mannosylphosphodecaprenol-mannooligosaccharide alpha-1,6-mannosyltransferase (EC 2.4.1.199)| UDP-Galf: rhamnopyranosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,4/1,5-galactofuranosyltransferase (EC 2.4.1.287)| UDP-Galf: galactofuranosyl-galactofuranosyl-rhamnosyl-N-acetylglucosaminyl-PP-decaprenol beta-1,5/1,6-galactofuranosyltransferase (EC 2.4.1.288)| dTDP-L-Rha: N-acetylglucosaminyl-PP-decaprenol alpha-1,3-L-rhamnosyltransferase (EC 2.4.1.289);annot2=(hmm search - phmmer) gi|157953301|ref|YP_001498192.1| hypothetical protein AR158_C110R [Paramecium bursaria Chlorella virus AR158];annot3=(hmm search - phmmer) gi|61805973|ref|YP_214333.1| hypothetical protein PSSM2_101 [Prochlorococcus phage P-SSM2];annot4=(hmm search - phmmer) gi|157952485|ref|YP_001497377.1| hypothetical protein NY2A_B181L [Paramecium bursaria Chlorella virus NY2A];annot5=(hmm search - phmmer) gi|157953362|ref|YP_001498253.1| hypothetical protein AR158_C171L [Paramecium bursaria Chlorella virus AR158];annot6=(hmm search - phmmer) fig|268746.6.peg.87 [Phage protein] [ACLAME_Phage_proteins_with_unknown_functions| Phage_cyanophage| Phage_experimental] [268746.6] [Prochlorococcus phage P-SSM2.];annot7=(hmm search - phmmer) fig|1204516.3.peg.65 [Phage protein] [ACLAME_Phage_proteins_with_unknown_functions| Phage_cyanophage| Phage_experimental] [1204516.3] [Aeromonas phage CC2];annot8=(hmm search - phmmer) sp|P0C860|MS3L2_HUMAN Putative male-specific lethal-3 protein-like 2 OS;annot9=(hmm search - phmmer) sp|P46671|FAR3_YEAST Factor arrest protein 3 OS;phyre2=Ribonuclease PH domain 2-like Ribonuclease PH domain 2-like Ribonuclease PH domain 2-like
Or, rather, is there a way I can bin each annotation into certain categories such as "nucleotide biosynthesis", "carbohydrate metabolism" etc? Maybe that would be easier?
It might be worth taking a look at something like CADD. https://academic.oup.com/nar/article/47/D1/D886/5146191 This article is already pretty old and Im sure there is more uptodate thought on the matter, but its at least a framework for thinking about how annotations might be integrated.
its probably also worth making a covariance or correlation matrix for all pairs of annotations. How often do they agree?
In your comment, what you are proposing is a major undertaking at present. the field of ontological database curation is exploding. The answer is sure you can bin however you want. the problem is, how do you do it in a way that is accurate and meaningful enough for you to be able to study other biological phenomena using that as a lens...