You may have to look directly into the script to find out how this is done, as I don't recall ever reading details of this procedure. I would suspect that if prokka
uses percent identity or coverage, it is done in a loose fashion.
While your concern is legitimate, I'd like to think that Torsten knows what he's doing when it comes to annotations. Besides, like with any automated annotation procedure, I don't think the goal is to identify only narrow family members or exact orthologs. The subsequent annotation by HMMs would never allow it, because neither Pfam nor TIGR HMMs are sufficiently tuned to identify only family members. Many of their HMMs identify superfamily members despite the professed goals and Pfam name (Protein FAMilies). Some time ago I contributed an HMM to Pfam that explicitly identifies superfamily members, and they are still using it as such. The annotation you'd get from a match to that HMM is Endonuclease/Exonuclease/phosphatase family
even though those are 3 distinct functionalities and can't possibly be members of the same protein family. General functionality aside, there are at least 4 different groups of substrates for those metalloenzymes, and probably a dozen subgroups if you break it down by the exact substrate.
If percent identity or coverage were to be used strictly for annotations (and not just with prokka
), many incomplete proteins from metagenomes would fail the coverage criterion, and many viruses, thermophiles or other organisms with exotic protein composition would fail the percent identity. That would leave many proteins unannotated, even though they are realistically of known function.
It may help to think about it this way: if a protein is a lipid methyltransferase but is instead annotated as a nucleotide methyltransferase, is that really a very wrong annotation? Do we prefer annotations that are generally correct even if unreliable in fine details, or would it be better for that protein to remain unannotated? Ideally we'd like the annotations to be correct in general and in details, but that is not realistic at this point as so many protein (super)families have not been experimentally characterized.
Thank you for the thorough response! This has helped clear up my understanding of the goal of annotation.