Question

How to get consensus annotations for a de novo orthogroup/ortholog analysis?

1

Entering edit mode

3.6 years ago

O.rka ▴ 750

I ran OrthoFinder on a dataset and have a bunch of [transcript] -> [orthogroup] identifier mappings. I also have transcript-level annotations but I'm struggling to get orthogroup-level annotations because most of the annotations are slightly different.

Are there any (graph-based?) solutions that take in a list of similar proteins and then finds a consensus annotation?

Basically, there's a lot of instances of this:

NODE_78620_length_358_cov_1.452632_g41825_i0    WP_085681669.1 ATP-dependent DNA helicase RecG [Marinobacter salarius]
NODE_59519_length_393_cov_1.318750_g26533_i0    HBI79933.1 ATP-dependent DNA helicase RecG [Marinobacter adhaerens]
NODE_56681_length_706_cov_2.672986_g26179_i0    WP_209399027.1 ATP-dependent DNA helicase RecG [Marinobacter salsuginis]
NODE_87702_length_342_cov_2.297398_g47662_i0    WP_085681669.1 ATP-dependent DNA helicase RecG [Marinobacter salarius]

See how all are pointing to ATP-dependent DNA helicase RecG but there are different species and accession IDs. This is the most simple case. Many others are riddled with edge cases which is why I'm checking if anyone has a method to do this so I don't try and reinvent the wheel.

metatranscriptomics metagenomics transcriptomics • 1.7k views

ADD COMMENT • link updated 3.6 years ago by Dunois ★ 2.9k • written 3.6 years ago by O.rka ▴ 750

0

Entering edit mode

I think the most you can do is just take all those annotations, deduplicate them, concatenate them, and use that string as the "consensus" annotation. For example, something like Kinase A, Kinase B, Kinase C if you had like a bunch of these each in your orthogroup.

In general, I think the notion of "consensus" annotations for OrthoFinder results is a bit problematic. The orthogroups the tool reports will depend on the set of species the user supplies as input, and the node on the tree at which the orthogroup under consideration lies (if the user has chosen to look at one of the many HOGs for example).

I am also curious to hear a little bit more about your edge cases.

In the meantime, I urge you to take a look at the discussions here:

GitHub Issue #362

GitHub Issue #451

GitHub Issue #373

These may be helpful.

ADD REPLY • link 3.6 years ago by Dunois ★ 2.9k

1

Entering edit mode

I appreciate this! I understand it could be a bit problematic and maybe a concatenated dereplicated annotation is the most appropriate. Here are a few types of edge cases:

Orthogroup_1:

WP_136629512.1 copper resistance system multicopper oxidase [Marinobacter salsuginis]
WP_088827080.1 copper resistance system multicopper oxidase [Marinobacter sp. es.048]
MBG13704.1 copper oxidase [Alcanivorax sp.]
WP_136629512.1 copper resistance system multicopper oxidase [Marinobacter salsuginis]

Maybe I should note that I'm using best hit to NR as my main annotation source. I have KOFAM hits but these are much more sparse. Orthogroup_2:

WP_127554864.1 hypothetical protein [Saccharospirillum alexandrii]
MBL84502.1 DNA repair protein [Marinobacter sp.]
WP_088557700.1 DNA repair protein [Marinobacter sp. es.042]
WP_127554864.1 hypothetical protein [Saccharospirillum alexandrii]
WP_127554864.1 hypothetical protein [Saccharospirillum alexandrii]
MBL84502.1 DNA repair protein [Marinobacter sp.]
WP_088557700.1 DNA repair protein [Marinobacter sp. es.042]

Orthogroup_3:

WP_127556764.1 methylated-DNA--[protein]-cysteine S-methyltransferase [Saccharospirillum alexandrii]
PWL24267.1 cysteine methyltransferase [Fluviicola sp. XM-24bin1]
WP_027242382.1 trifunctional transcriptional activator/DNA repair protein Ada/methylated-DNA--[protein]-cysteine S-methyltransferase [Pseudophaeobacter arcticus]
WP_127556764.1 methylated-DNA--[protein]-cysteine S-methyltransferase [Saccharospirillum alexandrii]
PWL24267.1 cysteine methyltransferase [Fluviicola sp. XM-24bin1]
WP_085628306.1 MULTISPECIES: methylated-DNA--[protein]-cysteine S-methyltransferase [Marivita]
WP_027242382.1 trifunctional transcriptional activator/DNA repair protein Ada/methylated-DNA--[protein]-cysteine S-methyltransferase [Pseudophaeobacter arcticus]

Orthogroup_4:

WP_085632770.1 MULTISPECIES: acryloyl-CoA reductase [Marivita]
WP_085632770.1 MULTISPECIES: acryloyl-CoA reductase [Marivita]
WP_085632770.1 MULTISPECIES: acryloyl-CoA reductase [Marivita]
WP_085632770.1 MULTISPECIES: acryloyl-CoA reductase [Marivita]
MBO6885897.1 oxidoreductase [Marivita sp.]
MBO6676874.1 oxidoreductase [Hyphomicrobiales bacterium]
MBO6885897.1 oxidoreductase [Marivita sp.]

Orthogroup_5:

WP_069183863.1 MULTISPECIES: AraC family transcriptional regulator [Marinobacter]
MTJ00553.1 AraC family transcriptional regulator [Marinobacter adhaerens]
WP_007151860.1 AraC family transcriptional regulator [Marinobacter algicola]
WP_069183863.1 MULTISPECIES: AraC family transcriptional regulator [Marinobacter]
WP_069183863.1 MULTISPECIES: AraC family transcriptional regulator [Marinobacter]
HAU18574.1 AraC family transcriptional regulator [Marinobacter adhaerens]
ADP96820.1 transcriptional regulator, AraC family protein [Marinobacter adhaerens HP15]

Orthogroup_6:

MBO6730368.1 YeeE/YedE family protein [Maricaulis sp.]
WP_153633487.1 YeeE/YedE family protein [Marinobacter salsuginis]
PCJ11299.1 hypothetical protein COA98_07855 [Candidatus Marinimicrobia bacterium]
WP_085631163.1 MULTISPECIES: hypothetical protein [Marivita]
WP_085631163.1 MULTISPECIES: hypothetical protein [Marivita]
WP_153633487.1 YeeE/YedE family protein [Marinobacter salsuginis]
WP_085631163.1 MULTISPECIES: hypothetical protein [Marivita]

ADD REPLY • link 3.6 years ago by O.rka ▴ 750

0

Entering edit mode

Thanks for sharing this!! All of these orthogroups actually look fine? Could you elaborate on why you consider these edge cases?

ADD REPLY • link 3.6 years ago by Dunois ★ 2.9k

1

Entering edit mode

Edge cases in terms of making a script that can parse it and come up with a single annotation without hardcoding everything.

ADD REPLY • link 3.6 years ago by O.rka ▴ 750

0

Entering edit mode

I don't think these are really edge cases. You could always remove most stop words (e.g., MULTISPECIES:), and then just concatenate the rest. I don't think that'd turn out to be too bad.

I suppose you could also just go with a majority rule, and just take the most frequent annotation as the "representative" annotation for these edge case orthogroups. (That'd also automatically work out for non-edge cases.)

ADD REPLY • link 3.6 years ago by Dunois ★ 2.9k