Question

Correct formatting for IDs in OMA standalone .splice files to properly identify splice variants

1

Entering edit mode

5.7 years ago

eschang1 ▴ 10

I see that for the input .splice files, OMA standalone requires that the individual IDs are unique prefixes of your FASTA headers, and proteins that are splice variants of the same gene should be listed in the splice file like "ENSP00000384207; ENSP00000263741; ENSP00000353094".

It looks like NCBI and Ensembl have annotation tables that can be downloaded that will make associating proteins with genes fairly straightforward. In the annotation tables, proteins are usually identified by the shortest version of their name, something like NP_001027594.1.

To keep it brief, will OMA be able to recognize a splice file line like "NP_001027594.1;NP_001027593.1" if the actual FASTA headers are more complicated, i.e. something like:

NP_001027594.1 homeobox transcription factor Pax1/9 [Ciona intestinalis]

NP_001027593.1 DEAD-Box Protein [Ciona intestinalis]

Is this what the manual means by "unique prefixes of FASTA headers"? Just wanted to make sure that I didn't need to reformat my FASTA headers before diving into

And secondly, does the All vs All step use the splice variant information? Or is it possible to do the All vs. All and then try running the OMA orthology algorithms with and without this information?

Thank you so much! Running OMA standalone on some of my own test data sets so far has been super smooth.

Cheers, Sally

OMA orthologs orthology • 1.6k views

ADD COMMENT • link 5.7 years ago by eschang1 ▴ 10

0

Entering edit mode

Okay great, this all in line with that I had gathered from the manual, but wanted to clarify before I started to put together those .splice files. Thanks so much! Chers, Sally

ADD REPLY • link 5.7 years ago by eschang1 ▴ 10

score 2 · Answer 1 · 2019-03-20

Hi,

yes, this is exactly what is meant with "unique prefixes of FASTA headers". You don't have to specify the full fasta headers, but the protein ID (or even parts of it) is sufficient if it identifies the protein uniquely.

The All-vs-All step does not make use of the splicing information. We still compute the all-vs-all for all proteins and will only select the best variant (based on the total nr of homologous hits with all other genomes) in the later step. You can turn on or off the UseOnlyOneSplicingVariant, and check for the different output that gets produced. But changing the *.splice files will not work - the internal database will not be updated, or, if previously no splicing file has been defined, invalidated.

In case you realize some problem with the splicing variants after the AllAll, it might still be possible to update it manually. It might become a feature in a future release of OMA standalone.

Cheers Adrian