Dear Everyone,
I have around 200 genes from one species for which I want to study the evolutionary forces acting on these genes (may be positive selection) on orthologs of 20 sequenced genomes.
As a first step, I did all-vs-all blast of these 20 species and tried to classify these genes into gene families from where I can get the orthologs and paralogs of these genes. I used two different programs to do this job 1). ortholomcl, from where I can get nicely categorized output in the form of orthologs, inparalogs and co-orthologs and the tool is based on graph based heuristics. First, I used the orthomcl with default parameters (evalue 1e-5 and percent identity 50%). The tool seems to work fine for the genes with large families but not for the genes with small families. Further, I went up and down with the orthomcl parameters but not much effect on these families, genes with the small families were classified into different families. Please note when I talk about small and large gene families and family size, this is based on some known gene families in my genome that are well characterized.
So I decided to try with another tool 2). silix which simply gives the genes clustered into families and is based on finding similarity across a linked network. But this does not seems to work fine for the large families. As it keeps picking up more and more domains for large families, e.g. for a family that actually has size of 74 genes, silix reports this family as 5000 genes. Orthomcl gave correct results in these case.
Here is one paper that shows different genes families require different program parameters for correct resolution and has given the strategy to classify the gene families in newly sequenced species by using the information from the known gene families in model species. But using their approach for my work has two drawbacks: first, this is not possible to do this work for 200 genes. Second, this has not addressed if the gene family is not classified well before and not all the gene families are well characterized.
http://www.plosone.org/article/info:doi/10.1371/journal.pone.0013409
Can someone help me by suggesting the best approach to classify my genes into gene families. As different gene families require different parameters, will this work if I give one stringency criteria for all the genes?
Hope I am clear in asking the question. Kindly let me know if something I misunderstood about selection studies.
Thanks, R.
RT, I am first author of the comparative gene family classification paper you cite (Frech and Chen, 2010). Reading your question and the comments below, my impression is that you confuse ortholog groups with gene families. OrthoMCL is used to detect the former and not the latter, which explains why OrthoMCL "tends to go a little too fine-grained".
Let's take your example from below. You say: "I have one gene family that is very well characterized and it has 16 members in A. thaliana genome. This family has been validated by several groups so I am very confident about this. When I analyzed this family in my orthomcl results then orthomcl has classified 12 genes in one family, 3 in one family and one gene in a separate family. We can say based on this that orthomcl is too stringent with default paramters."
The reason why OrthoMCL splits this gene family of 16 genes into three ortholog groups (not gene families!) is most likely not because your parameters are too stringent, but because this gene family indeed splits into three ortholog groups in your data set! This basically means that some members of your gene families have different orthologs than others. This is expected to happen for many gene families, depending on the relatedness of your species.
You say that you do not have well classified reference gene families for all your gene families of interest, so comparative gene family classification is probably not the way to go. What I would do in this case is to run the program TribeMCL (not OrthoMCL) with different inflation values and see how well your gene families get resolved. Then, for each gene family, pick the TribeMCL cluster that resolved this gene family best.
I found the implementation of TribeMCL in SCPS easier to set-up and install (mostly because I couldn't actually find working software for TribeMCL anywhere when I went searching, The link in the original paper is defunct IIRC). The approach you recommend in the cited paper I actually found very useful in my work.