I ran OrthoFinder on a dataset and have a bunch of [transcript] -> [orthogroup] identifier mappings. I also have transcript-level annotations but I'm struggling to get orthogroup-level annotations because most of the annotations are slightly different.
Are there any (graph-based?) solutions that take in a list of similar proteins and then finds a consensus annotation?
Basically, there's a lot of instances of this:
NODE_78620_length_358_cov_1.452632_g41825_i0 WP_085681669.1 ATP-dependent DNA helicase RecG [Marinobacter salarius]
NODE_59519_length_393_cov_1.318750_g26533_i0 HBI79933.1 ATP-dependent DNA helicase RecG [Marinobacter adhaerens]
NODE_56681_length_706_cov_2.672986_g26179_i0 WP_209399027.1 ATP-dependent DNA helicase RecG [Marinobacter salsuginis]
NODE_87702_length_342_cov_2.297398_g47662_i0 WP_085681669.1 ATP-dependent DNA helicase RecG [Marinobacter salarius]
See how all are pointing to ATP-dependent DNA helicase RecG
but there are different species and accession IDs. This is the most simple case. Many others are riddled with edge cases which is why I'm checking if anyone has a method to do this so I don't try and reinvent the wheel.
I think the most you can do is just take all those annotations, deduplicate them, concatenate them, and use that string as the "consensus" annotation. For example, something like
Kinase A, Kinase B, Kinase C
if you had like a bunch of these each in your orthogroup.In general, I think the notion of "consensus" annotations for
OrthoFinder
results is a bit problematic. The orthogroups the tool reports will depend on the set of species the user supplies as input, and the node on the tree at which the orthogroup under consideration lies (if the user has chosen to look at one of the manyHOGs
for example).I am also curious to hear a little bit more about your edge cases.
In the meantime, I urge you to take a look at the discussions here:
GitHub Issue #362
GitHub Issue #451
GitHub Issue #373
These may be helpful.
I appreciate this! I understand it could be a bit problematic and maybe a concatenated dereplicated annotation is the most appropriate. Here are a few types of edge cases:
Orthogroup_1:
Maybe I should note that I'm using best hit to NR as my main annotation source. I have KOFAM hits but these are much more sparse. Orthogroup_2:
Orthogroup_3:
Orthogroup_4:
Orthogroup_5:
Orthogroup_6:
Thanks for sharing this!! All of these orthogroups actually look fine? Could you elaborate on why you consider these edge cases?
Edge cases in terms of making a script that can parse it and come up with a single annotation without hardcoding everything.
I don't think these are really edge cases. You could always remove most stop words (e.g.,
MULTISPECIES:
), and then just concatenate the rest. I don't think that'd turn out to be too bad.I suppose you could also just go with a majority rule, and just take the most frequent annotation as the "representative" annotation for these edge case orthogroups. (That'd also automatically work out for non-edge cases.)