I have been using oma standalone to assign orthologs between vertebrate genomes like so:
I essentially have one mammal and three birds and am trying to find 1:1 orthologs as TOGA was computing too many missing orthologs due to the intronic and intergenic divergence.
thanks for your interest in OMA standalone. regarding your questions, here are my thoughts on that.
It is generally not a problem to use mitochondrial proteins, but OMA does not handle them in a specific way.
I don't understand your question exactly. Are you asking if you can place your proteins into existing gene families and extract the orthologous/paralogous relations from them? or rather if you can run OMA standalone with non-complete proteomes? The later would be fine (you can export the genomes of interest from the https://omabrowser.org/export and run OMA standalone). If you want to place your proteins inside existing Hierarchical Orthologous Groups, you could use the Fastmapping tool on the OMA Browser (https://omabrowser.org/oma/fastmapping/) where you can upload your sequences and they will be mapped into existing HOGs, or simply to the closest sequence in the database.
If you mean using the export functionality of the OMA Browser, this will mainly speed up computations, as the expensive All-vs-All computations among the exported genomes is already done. Also, adding more species usually improves orthology detection as they bring more resolution in the family.
Having different ID formats is ok for OMA standalone. The output files will simply use the fasta header line as IDs.
A small remark about how you started the OMA standalone jobs:
would be better. Only the All-vs-All part should be run in parallel. If you don't use a scheduler, but run on a single machine, you can further skip the -W 7000 argument, as you don't want your jobs to stop after ~2hours.
Let us know if you have more specific questions. Cheers Adrian
I'm thinking I will keep only the longest transcript for each gene based on the GTF and then extract the protein sequences. I assume this preprocessing is required? I have the following formats, and the mouse faa seems to have proteins for multiple transcripts of the same gene.
yes, you need to provide protein sequences, so you will need to translate your transcripts before running OMA standalone.
OMA standalone is able to work with several splicing variants per gene. But you need to provide an additional <genome>.splice file (next to the <genome>.fa file), which lists all the splicing variants per gene on one line. OMA Standalone will then select the evolutionary best conserved variant. However, this requires quite a bit of additional computing time.
Re: I don't understand your question exactly. Are you asking if you can place your proteins into existing gene families and extract the orthologous/paralogous relations from them? or rather if you can run OMA standalone with non-complete proteomes? The later would be fine (you can export the genomes of interest from the https://omabrowser.org/export and run OMA standalone). If you want to place your proteins inside existing Hierarchical Orthologous Groups, you could use the Fastmapping tool on the OMA Browser (https://omabrowser.org/oma/fastmapping/) where you can upload your sequences and they will be mapped into existing HOGs, or simply to the closest sequence in the database.
I was referring to this line from the documentation:
"Additionally it is possible to export the precomputed all-against-all for any of the >2000 genomes currently in the oma database."
Thank you!!
One more question:
I'm thinking I will keep only the longest transcript for each gene based on the GTF and then extract the protein sequences. I assume this preprocessing is required? I have the following formats, and the mouse faa seems to have proteins for multiple transcripts of the same gene.
For example.
Mouse
Chicken
yes, you need to provide protein sequences, so you will need to translate your transcripts before running OMA standalone.
OMA standalone is able to work with several splicing variants per gene. But you need to provide an additional
<genome>.splice
file (next to the<genome>.fa
file), which lists all the splicing variants per gene on one line. OMA Standalone will then select the evolutionary best conserved variant. However, this requires quite a bit of additional computing time.Thank you!! That answers my question.
What ive done is, for each gene ID, I list all the unique transcript IDs like so
Does this look like the format you're looking for?
slightly different format is required. simply add all the fasta headers on the same line, separated by a
;
character:(assuming your fasta header looks
Re: I don't understand your question exactly. Are you asking if you can place your proteins into existing gene families and extract the orthologous/paralogous relations from them? or rather if you can run OMA standalone with non-complete proteomes? The later would be fine (you can export the genomes of interest from the https://omabrowser.org/export and run OMA standalone). If you want to place your proteins inside existing Hierarchical Orthologous Groups, you could use the Fastmapping tool on the OMA Browser (https://omabrowser.org/oma/fastmapping/) where you can upload your sequences and they will be mapped into existing HOGs, or simply to the closest sequence in the database.
I was referring to this line from the documentation:
"Additionally it is possible to export the precomputed all-against-all for any of the >2000 genomes currently in the oma database."
Thank you so much, this answers my questions