Question

Maintaining gene ID correspondence across strains

0

Entering edit mode

7.5 years ago

Anand Rao ▴ 640

I am embarking on gene prediction and annotation of 100+ fungal draft genomes - all from related strains of the same species.

At the end, I want to insure that gene IDs across these annotations reflect their mutual relationships. For example: Gene IDs for gene 1 in strains 1, 2, ... N = G1_Str1, G1_Str2, ..., G1_StrN . . . Gene IDs for gene Z in strains 1, 2, ... N = GZ_Str1, GZ_Str2, ..., GZ_StrN

I expect there to be a lot of gene presence / absence variation. So Z (see above) would be the number of genes in the pan genome across these 100+ strains.

Gene ID correspondence could be inferred post-prediction + annotation, of course, via inference of orthology and paralogy using OrthoMCL or some such software, and/or synteny relationships. But I have never performed such analyses on 100+ genomes, only pair-wise comparisons. So, I'm not sure how these analyses might scale up, if at all. (~ 17K genes, ~ 50MB genome size per strain) - Any advice?

Are there any other tricks e.g. using gene/protein length conservation and % identity as a metric to insure gene IDs for what are deemed counterparts across these fungal strains have identical gene IDs AFTER gene prediction, but BEFORE gene annotation?

If there is indeed a way that is not computationally very intense, but also scientifically acceptable, where the entire set of gene IDs do not have to be renamed for each strain post annotation, I am all eyes and ears. But if you have very strong opinions on why it should not be done so, but in the traditional, sequential order, I'd like to understand them as well. Thanks!

Gene prediction Gene annotation Gene ID • 1.6k views

ADD COMMENT • link updated 7.5 years ago by Jean-Karim Heriche 27k • written 7.5 years ago by Anand Rao ▴ 640

score 0 · Answer 1 · 2017-06-10

0

Entering edit mode

7.5 years ago

Jean-Karim Heriche 27k

If there's not too much structural variation, you could probably rely on synteny.

ADD COMMENT • link 7.5 years ago by Jean-Karim Heriche 27k

0

Entering edit mode

Thanks for your reply. In response:

I do expect quite a bit of PAV in the accessory genome.
PAV is known to be especially high for genes of biological significance for host-pathogen interactions (in the accessory genome) - effector genes and those involved in secondary metabolism (that control pathogenicity, host range, virulence)
To complicate matters further, the numbers of accessory chromosomes that carry some of these genes varies across strains in this species (cannot measure that in my strains, since drafts are nowhere near complete)

So I wonder if reliance on synteny should be restricted to core genome.... Your thoughts?

ADD REPLY • link 7.5 years ago by Anand Rao ▴ 640

0

Entering edit mode

I would use it on the core genome then if it is suitable. The idea would be to reduce the number of genes you have to deal with. Another idea would be to infer orthology with respect to a closely-related, well-annotated species. Because orthology is a transitive property, this would allow you to proceed pairwise, i.e. each strain vs the reference species. Also, 100+ genomes shouldn't be too much of a problem for inferring phylogenetic trees provided you have suitable compute resources. For example, look at the TreeFam pipeline. I would advise against inferring orthology using a method that doesn't build a phylogenetic tree if you expect many duplications.

ADD REPLY • link 7.5 years ago by Jean-Karim Heriche 27k