I see that for the input .splice files, OMA standalone requires that the individual IDs are unique prefixes of your FASTA headers, and proteins that are splice variants of the same gene should be listed in the splice file like "ENSP00000384207; ENSP00000263741; ENSP00000353094".
It looks like NCBI and Ensembl have annotation tables that can be downloaded that will make associating proteins with genes fairly straightforward. In the annotation tables, proteins are usually identified by the shortest version of their name, something like NP_001027594.1.
To keep it brief, will OMA be able to recognize a splice file line like "NP_001027594.1;NP_001027593.1" if the actual FASTA headers are more complicated, i.e. something like:
NP_001027594.1 homeobox transcription factor Pax1/9 [Ciona intestinalis]
NP_001027593.1 DEAD-Box Protein [Ciona intestinalis]
Is this what the manual means by "unique prefixes of FASTA headers"? Just wanted to make sure that I didn't need to reformat my FASTA headers before diving into
And secondly, does the All vs All step use the splice variant information? Or is it possible to do the All vs. All and then try running the OMA orthology algorithms with and without this information?
Thank you so much! Running OMA standalone on some of my own test data sets so far has been super smooth.
Cheers, Sally
Okay great, this all in line with that I had gathered from the manual, but wanted to clarify before I started to put together those .splice files. Thanks so much! Chers, Sally