Question

How To Correctly Classify Gene Families?

2

Entering edit mode

12.3 years ago

GR ▴ 400

Dear Everyone,

I have around 200 genes from one species for which I want to study the evolutionary forces acting on these genes (may be positive selection) on orthologs of 20 sequenced genomes.

As a first step, I did all-vs-all blast of these 20 species and tried to classify these genes into gene families from where I can get the orthologs and paralogs of these genes. I used two different programs to do this job 1). ortholomcl, from where I can get nicely categorized output in the form of orthologs, inparalogs and co-orthologs and the tool is based on graph based heuristics. First, I used the orthomcl with default parameters (evalue 1e-5 and percent identity 50%). The tool seems to work fine for the genes with large families but not for the genes with small families. Further, I went up and down with the orthomcl parameters but not much effect on these families, genes with the small families were classified into different families. Please note when I talk about small and large gene families and family size, this is based on some known gene families in my genome that are well characterized.

So I decided to try with another tool 2). silix which simply gives the genes clustered into families and is based on finding similarity across a linked network. But this does not seems to work fine for the large families. As it keeps picking up more and more domains for large families, e.g. for a family that actually has size of 74 genes, silix reports this family as 5000 genes. Orthomcl gave correct results in these case.

Here is one paper that shows different genes families require different program parameters for correct resolution and has given the strategy to classify the gene families in newly sequenced species by using the information from the known gene families in model species. But using their approach for my work has two drawbacks: first, this is not possible to do this work for 200 genes. Second, this has not addressed if the gene family is not classified well before and not all the gene families are well characterized.

http://www.plosone.org/article/info:doi/10.1371/journal.pone.0013409

Can someone help me by suggesting the best approach to classify my genes into gene families. As different gene families require different parameters, will this work if I give one stringency criteria for all the genes?

Hope I am clear in asking the question. Kindly let me know if something I misunderstood about selection studies.

Thanks, R.

selection classification • 9.3k views

ADD COMMENT • link updated 6.7 years ago by rajdeepjaswal52 • 0 • written 12.3 years ago by GR ▴ 400

4

Entering edit mode

RT, I am first author of the comparative gene family classification paper you cite (Frech and Chen, 2010). Reading your question and the comments below, my impression is that you confuse ortholog groups with gene families. OrthoMCL is used to detect the former and not the latter, which explains why OrthoMCL "tends to go a little too fine-grained".

Let's take your example from below. You say: "I have one gene family that is very well characterized and it has 16 members in A. thaliana genome. This family has been validated by several groups so I am very confident about this. When I analyzed this family in my orthomcl results then orthomcl has classified 12 genes in one family, 3 in one family and one gene in a separate family. We can say based on this that orthomcl is too stringent with default paramters."

The reason why OrthoMCL splits this gene family of 16 genes into three ortholog groups (not gene families!) is most likely not because your parameters are too stringent, but because this gene family indeed splits into three ortholog groups in your data set! This basically means that some members of your gene families have different orthologs than others. This is expected to happen for many gene families, depending on the relatedness of your species.

You say that you do not have well classified reference gene families for all your gene families of interest, so comparative gene family classification is probably not the way to go. What I would do in this case is to run the program TribeMCL (not OrthoMCL) with different inflation values and see how well your gene families get resolved. Then, for each gene family, pick the TribeMCL cluster that resolved this gene family best.

ADD REPLY • link 12.2 years ago by Christian ★ 3.1k

0

Entering edit mode

I found the implementation of TribeMCL in SCPS easier to set-up and install (mostly because I couldn't actually find working software for TribeMCL anywhere when I went searching, The link in the original paper is defunct IIRC). The approach you recommend in the cited paper I actually found very useful in my work.

ADD REPLY • link 12.2 years ago by DG 7.3k

score 1 · Answer 1 · 2012-08-20

1

Entering edit mode

12.3 years ago

Pawel Szczesny 3.2k

I don't know how non-standard your species are, but instead of trying to classify all 200 genes from the scratch, I would first assign obvious cases to COG or KO (Kegg Orthology) groups (this is independent of the family size). The number of genes that need to be analyzed will drop substantially, leaving you with a small number of genes to be assessed manually. Personally, I use CLANS (http://bioinfoserver.rsbs.anu.edu.au/programs/clans/) to analyze families that are not well characterized. Based on the visual inspection I decide what kind of threshold choose to consider particular protein as a member of the family.

ADD COMMENT • link 12.3 years ago by Pawel Szczesny 3.2k

1

Entering edit mode

There is also the PANTHER families database that can be used to assign genes to families/sub-families, the pre-defined OrthoMCL families (although I find they tend to go a little too fine-grained due to inparalogs), OMA, and Homologene that may be useful.

ADD REPLY • link 12.3 years ago by DG 7.3k

0

Entering edit mode

Dan- I am not very clear about your statement on Orthomcl families 'I find they tend to go a little too fine-grained due to inparalogs'. How inparalogs can make them fine-grained. It would be helpful if you can explain this. I ran myself orthomcl and have the same experience but want to know why as it is widely used tool.

I observed few problems with orthomcl. I would like to describe here if you or someone else can explain me. I have one gene family that is very well characterized and it has 16 members in A. thaliana genome. This family has been validated by several groups so I am very confident about this. When I analyzed this family in my orthomcl results then orthomcl has classified 12 genes in one family, 3 in one family and one gene in a separate family. We can say based on this that orthomcl is too stringent with default paramters so I reduced the orthomcl stringency crietria but surprisingly multiple parameters did not affect the family (WEIRD for me). I checked my all-vs-all blast results to check the evalue and %identity for the genes that were clustered into different families. two genes that were in different families has the evalue<-110 and 34% identity and my stringency criteria was evalue<-3 and %identity 25%.

I observed the same pattern with so many families. Does orthomcl algorithm has failed for my dataset or some other problem. I left orthomcl after this but still curious. Any clue?

Sorry for the long query. But I really need suggestions on this.

ADD REPLY • link 12.2 years ago by GR ▴ 400

1

Entering edit mode

OrthoMCL is quite judicious (and correct) in splitting off tight clusters of inparalog families together as an OrthoMCL group, it is what they are trying to do after all. However sometimes when you are trying to classify sequences you would also like to know that that small family of say A.thaliana genes, all inparalogs, are related to other clusters and what those clusters are. Mostly because when you are doing BLAST based searching of the clusters with your sequences, it may have its best hit against one of these small clusters. This can be the case when working with various eukaryotic groups for instance that are poorly characterized where there are lots of related taxa full of paralagous sequences. Tend yo see this a lot with Ciliates, Trypanasomes, Giardia, etc.

ADD REPLY • link 12.3 years ago by DG 7.3k

0

Entering edit mode

Dear Both,

Thanks a lot for your help. It has given me a direction to proceed.

I have only few standard species in these databases (all plants so may be I can use the ensembl database for this). I am planning to pick the families from one of the database and as I tested silix and orthomcl for multiple parameters wherever I will get the correct results for my standard species, I will define gene family in all the 20 species based on that parameter. I am sure this will reduce the number of gene families to be analyzed. What do you think guys?

ADD REPLY • link 12.3 years ago by GR ▴ 400

0

Entering edit mode

Sounds reasonable to use a combination approach like that. You may also want to check out SCPS (http://www.paccanarolab.org/software/scps/index.html) for clustering. It allows you to do MCL clustering at various cut-offs (like Tribe-MCL), Spectral Clustering, Connected Components Analysis, and Hierarchical clustering all in one package, the input being all-vs-all blast results. But for your case using pre-defined families first to assign genes to families may be the best bet

ADD REPLY • link 12.3 years ago by DG 7.3k

0

Entering edit mode

Thanks a lot Dan for all your help.

I was confused with orthomcl output as everyone recommended this tool and this is the most widely used tool by many labs here. I could not explain why this tool does not work well for my dataset. Many thanks for all your suggestions and prompt responses. This was really very helpful :)

ADD REPLY • link 12.3 years ago by GR ▴ 400

score 0 · Answer 2 · 2018-03-23

Hey everyone, I am working on genome-wide identification one gene family in fungal species. Using comparative genomics I identified members of this family in various species. Now I want to classify newly found members. I annotated these proteins using NCBI- CD search as well as using InterPro, Pfam. after that, I did MSA of these proteins using MAFFT and tree construction using MEGA 7. Everything is ok and matching with previously reported results except members of one of the species that is following a different pattern. In that member, my phylogeny clustering is not matching with the annotation results. The members that have the same domain is clustering with the different members. As this is a superfamily so subfamily members should also make clusters within the tree. Even the alignment results is not giving any consensus results. I am not getting any idea whether I should rely on annotation results, phylogeny results or alignment results. Thank You