I am embarking on gene prediction and annotation of 100+ fungal draft genomes - all from related strains of the same species.
At the end, I want to insure that gene IDs across these annotations reflect their mutual relationships. For example: Gene IDs for gene 1 in strains 1, 2, ... N = G1_Str1, G1_Str2, ..., G1_StrN . . . Gene IDs for gene Z in strains 1, 2, ... N = GZ_Str1, GZ_Str2, ..., GZ_StrN
I expect there to be a lot of gene presence / absence variation. So Z (see above) would be the number of genes in the pan genome across these 100+ strains.
Gene ID correspondence could be inferred post-prediction + annotation, of course, via inference of orthology and paralogy using OrthoMCL or some such software, and/or synteny relationships. But I have never performed such analyses on 100+ genomes, only pair-wise comparisons. So, I'm not sure how these analyses might scale up, if at all. (~ 17K genes, ~ 50MB genome size per strain) - Any advice?
Are there any other tricks e.g. using gene/protein length conservation and % identity as a metric to insure gene IDs for what are deemed counterparts across these fungal strains have identical gene IDs AFTER gene prediction, but BEFORE gene annotation?
If there is indeed a way that is not computationally very intense, but also scientifically acceptable, where the entire set of gene IDs do not have to be renamed for each strain post annotation, I am all eyes and ears. But if you have very strong opinions on why it should not be done so, but in the traditional, sequential order, I'd like to understand them as well. Thanks!
Thanks for your reply. In response:
So I wonder if reliance on synteny should be restricted to core genome.... Your thoughts?
I would use it on the core genome then if it is suitable. The idea would be to reduce the number of genes you have to deal with. Another idea would be to infer orthology with respect to a closely-related, well-annotated species. Because orthology is a transitive property, this would allow you to proceed pairwise, i.e. each strain vs the reference species. Also, 100+ genomes shouldn't be too much of a problem for inferring phylogenetic trees provided you have suitable compute resources. For example, look at the TreeFam pipeline. I would advise against inferring orthology using a method that doesn't build a phylogenetic tree if you expect many duplications.