Hi everyone,
I have found the predicted orthologs for two fungi through orthomcl algorithm, but when I look at the output table many of the proteins of one fungal have more than one hit and the same occurs for the other fungal. How can I say one protein has two orthologs in the other fungal, or only one?
Besides, the table give me a "normalized score" to each pair of predicted orthologs. Does anyone know what it means? I was looking for any formula or simple explanation for it but the only thing I've found is this: "Normalize ortholog and co-ortholog pairs for any two species by averaging the e-values across them, and normalize using that average" (http://www.ncbi.nlm.nih.gov/pubmed/21901743). I know it is a normalized value related to evalue, but how? Curiously, the maximum value it is 1.576 and many of the orthologs with more than one hit in the another fungal have this score too.
An02g14170 e_gw1.1.1058.1 0.241
An01g08960 e_gw1.1.1090.1 1.576
An15g05520 e_gw1.1.1090.1 1.576
The parameters that I used to find the orthologs were these:
- evalueExponentCutoff = -5 (BLAST evalue < or = to 1e-5; recommended parameter);
- percentMatchCutoff = 70
- I (inflation factor) = 1.5 (recommended parameter);
Thank you so much for any help!
@Jean-Karim: Thank you for your answer, but the only explanation in this paper is "a normalized similarity score" and it is recommended to see the Orthomcl Algorithm Document for the normalization function. I saw this document, but I'm not sure about what is the meaning of these score values yet. Would it be the formula present in the topic Find potential co-ortholog pairs? " Each CO(Ax,By) is given a pair weight: O(Ax,By) = (-log10(evalue(Ax,By)) + -log10(evalue(By,Ax))) / 2"? Furthermore, do you know which parameter in blastp can I use to see only 1:1 hits? Thanks again!
The description of the algorithm is in ref 7 of the paper you cite: Li L, Stoeckert CJ, Jr, Roos DS. OrthoMCL: identification of ortholog groups for eukaryotic genomes. Genome Research. 2003;13:2178-89.
In particular see fig2.
The raw score is as you describe above: the average of the -log of the e-values obtained by blastp A vs B and B vs A. This provides a measure of similarity between any two sequences. Before applying the MCL clustering algorithm, this score is normalized by dividing by the average weight of all pairs between the two specie e.g. for two genes A and B with A from fly and B from mouse, the raw score is (-log10(evalue(A,B)) + -log10(evalue(B,A))) / 2 and the normalized score is this divided by the average of all scores between fly and mouse. You don't need/want blastp to return only one hit, you just need to take the best one for each query sequence which should always be the first in the list returned by blastp.
Thank you so much for your help Jean-Karim. It's the first time I've read a good explanation about what is or how can I calculate the normalized score of MCL.