What OrthoMCL input file and output looks like?
2
2
Entering edit mode
10.3 years ago
Naren ▴ 1000

I do not have resources to use Linux right now. nor I know anyone who uses OrthMCL for protein clustering.

I want to know in what format we can provide input to the program and what we get in output.

Here I found the sample Output of the OrthoMCL Program for Protein clustering. I wonder what input format they have used to create such file. And I would also like to know what stands for what in output?

This is what it looks like.

ecoli6370: col125|YP_006311412.1 col139|YP_007556103.1 col23|YP_001729413.1 col3|NP_286258.1 col4|NP_308598.1  
ecoli6371: col125|YP_006312035.1 col127|YP_006770839.1 col131|YP_006779890.1 col134|YP_006785029.1 col3|NP_286985.1

I want to know in what format we can provide input to the program and what we get in output.

clustering-algorithms orthomcl • 9.2k views
ADD COMMENT
5
Entering edit mode
10.3 years ago
rwn ▴ 610

OrthoMCL is an analysis pipeline that uses Markov clustering (MCL) on the results from an all-versus-all BLASTp to infer homologous (i.e., both orthologous and paralogous) relationships among a set of protein sequences. Thus the raw input for an OrthoMCL analysis is a set of protein sequences in fasta format.

A pre-processing step of the OMCL pipeline is to adjust the fasta headers for your input sequences, such that they follow the form: "genome_id|protein_id", so the name 'col125' is an E. coli strain identifier and YP_006311412.1 is a GenBank protein accession number.

OMCL outputs a plain-text flat file where the proteins on each line are inferred to be homologues of each other. Groups may contain multiple sequences from the same genome; thus, these would be paralogues (assuming there are no errors in the programs inference). The term before the colon, "ecoli6370", is simply a group identifier. The proteins "col125|YP_006311412.1 col139|YP_007556103.1 col23|YP_001729413.1 col3|NP_286258.1 col4|NP_308598.1" are inferred to be members of the same orthologous group.

I'm not sure if OMCL can be run on non-Linux platforms? It also requires the setting up of a relational database such as MySQL or Oracle. You may want to look at another recently developed program called "get_homologs" (paper here). I haven't used it but it employs some other methods as well as OMCL to infer evolutionary relationships among a set of sequences.

ADD COMMENT
4
Entering edit mode
10.3 years ago

There are different modes of running OrthoMCL. I usually prefer to use mode 3 (--mode 3). With this I'm required to provide an all vs all blast output file (using -m 8) and a .gg file which lists name of isolate followed by a colon and the genes it contains (just as specified above i.e. ecoli6370 is the isolate id and ol125|YP_006311412.1, col139|YP_007556103.1 etc are ids for the genes it contains.)

The output from OrthoMCL is simple, each line basically represents a cluster of orthologous genes. The number of isolates with that gene and the number of genes in that cluster are shown in brackets after the cluster id (sometimes the num of genes and num of taxa might not be equal due to presence of paralogs).

Here is the sample output from OrthoMCL.

ORTHOMCL0(42 genes,18 taxa):     taxa1(gene1) taxa2(gene2) taxa3(gene3)...
ORTHOMCL1(15 genes,15 taxa):     taxa1(gene1) taxa2(gene2) taxa3(gene3)...
...

and so on.

OrthoMCL can be downloaded from https://code.google.com/p/ortholytics/source/browse/orthomcl.pl and MCL can be downloaded from http://micans.org/mcl/

ADD COMMENT
0
Entering edit mode

Thanks @chrispin . What about paralogs ? i mean taxa1(gene1) taxa1(gene2) are they present as any other ortholog or any other sign like comma or "/" is present between paralogs? I want to know this to write a program to process this output...

ADD REPLY
1
Entering edit mode

Paralogs are included in the OMCL output just like any other ortholog, as you say. So an orthologous group containing paralogs would look something like this:

group1: taxa1|gene1A taxa1|gene1B taxa2|gene1 taxa3|gene1 [etc...]

where gene1A and gene1B are inferred paralogs within taxa1's genome. But be careful! When working with multiple genomes there may be thousands of orthologous groups and OMCL will not get it right all of the time. Sometimes it will cluster sequences together that should probably be split, other times the converse. So I would look at the alignment for any group you suspect contains paralogs just to check. Draft genomes throw up other potential issues too -- this paper and this paper discuss these problems a bit.

Good luck!

ADD REPLY
1
Entering edit mode

If there are more genes than number of taxa in a cluster then that cluster contains paralogs. For example cluster ORTHOMCL0(42 genes,18 taxa) contains paralogs because it contains at least two genes for some taxa in the cluster. The program doesn't explicitly specify which taxa-gene combination represents paralogs, it's really left to the user to decide.

If you want to write a program to parse the output then you should check whether some taxa ids in the taxa-gene combinations e.g. taxa1(gene2) in a cluster appear more than one time (some of the genes in that taxa could be paralogs).

I think @rwn raises some good points as well in his/her reply.

ADD REPLY

Login before adding your answer.

Traffic: 2447 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6