I do not have resources to use Linux right now. nor I know anyone who uses OrthMCL for protein clustering.
I want to know in what format we can provide input to the program and what we get in output.
Here I found the sample Output of the OrthoMCL Program for Protein clustering. I wonder what input format they have used to create such file. And I would also like to know what stands for what in output?
This is what it looks like.
ecoli6370: col125|YP_006311412.1 col139|YP_007556103.1 col23|YP_001729413.1 col3|NP_286258.1 col4|NP_308598.1
ecoli6371: col125|YP_006312035.1 col127|YP_006770839.1 col131|YP_006779890.1 col134|YP_006785029.1 col3|NP_286985.1
I want to know in what format we can provide input to the program and what we get in output.
Thanks @chrispin . What about paralogs ? i mean taxa1(gene1) taxa1(gene2) are they present as any other ortholog or any other sign like comma or "/" is present between paralogs? I want to know this to write a program to process this output...
Paralogs are included in the OMCL output just like any other ortholog, as you say. So an orthologous group containing paralogs would look something like this:
group1: taxa1|gene1A taxa1|gene1B taxa2|gene1 taxa3|gene1 [etc...]
where gene1A and gene1B are inferred paralogs within taxa1's genome. But be careful! When working with multiple genomes there may be thousands of orthologous groups and OMCL will not get it right all of the time. Sometimes it will cluster sequences together that should probably be split, other times the converse. So I would look at the alignment for any group you suspect contains paralogs just to check. Draft genomes throw up other potential issues too -- this paper and this paper discuss these problems a bit.
Good luck!
If there are more genes than number of taxa in a cluster then that cluster contains paralogs. For example cluster ORTHOMCL0(42 genes,18 taxa) contains paralogs because it contains at least two genes for some taxa in the cluster. The program doesn't explicitly specify which taxa-gene combination represents paralogs, it's really left to the user to decide.
If you want to write a program to parse the output then you should check whether some taxa ids in the taxa-gene combinations e.g. taxa1(gene2) in a cluster appear more than one time (some of the genes in that taxa could be paralogs).
I think @rwn raises some good points as well in his/her reply.