Hello,
I am studying the topic of creating pan genomes. Specifically, I am working on eukaryotes, but I think my question also applies to prokaryotes.
After assembling an annotating all genomes, one usually needs to cluster all predicted gene/protein sequences in order to be able to compare genomes in terms of gene content. In other words, we need to match orthologs from different genomes to each other. As far as I understand, this is usually done with some clustering method, e.g. OrthoMCL, CD-HIT or GET_HOMOLOGOUES-EST.
When a cluster only contains 1 (or 0) genes per strain/sample, things are pretty straightforward. However, I couldn't find an explanation of resolving situations where multiple genes from the same strain occur in the same cluster. This happens when paralogs and orthologs are clustered together, and is rather common at least in my data.
My question is how should such clusters be treated? Do we just ignore the fact that they contain paralogs and count them as one gene, and calculate the occupancy as the corresponding number of strains in the cluster as usual? Or maybe some processing of raw clusters should be performed first to avoid paralogs in clusters? If so, can you refer me to some common method? This choice will affect the number of genes in the resulting pan-genome, so it should be made carefully. However, I haven't seen any paper that refers to this issue, so I might be missing something.
Would appreciate a clarification of this matter. Thank you!
Thanks for the interesting answer. Can you suggest a tool that split paralogs out off clusters? I agree that the number of genes in a pan genome is not particularly meaningful, but would still argue that leaving paralogs within cluster will result in somewhat "incorrect" results, since paralogs will be treated as the same gene, leading to loss of information in the final pan-genome.
You will lose some information by doing that yes, but whether that matters just depends on what the downstream analysis is going to be. Core/accessory genome phylogenetics is about the main downstream analysis I can think of where you would probably want to ensure no paralogues.
It's been a long time since I used it, but OrthoMCL might have that option IIRC. I'm pretty sure my current go-to tool,
roary
has an option for it, but it's prokaryote specific.Thanks again. Indeed it looks like Roary directly tackles this issue and tries to solve it using Conserved Gene Neighborhoods (CGN, known in eukaryotes as gene syntenny). As for OrthoMCL, it does treat paralogs and orthologs differently, but as far as I could tell, there is no obvious option to force OrthoMCL to only cluster one gene per species (strain) in a cluster.