Hello,
I'd like to group genes of a specific genome (tomato in my case) into gene families. I am using sequence similarity (not domain), specifically with the software OrthoFinder.
The input I use is the whole proteome of the species. Interestingly, I am only getting ~18% of genes grouped into gene families. The rest of the genes are singletons. This is in contrast to most works and DBs I see, where most genes (sometimes ~80%) belong to gene families. In most cases I see people using multiple genomes of various species rather than a single genome. I suspect that this is the reason for the low extent of clustering, since addition of more genomes can create new graph edges.
My question is what would be the right way to achieve my goal, which is just clustering genes from a single genome. Should I:
1) stick with what I'm doing, since this is correct from the perspective of a single genome?
2) include proteins from other species in the analysis? This seems a bit strange and would mean that my results are dependent on the species I choose to include.
3) Use some strategy that assigns genes to pre-defined gene families rather than cluster from scratch?
4) something else?
Thanks!
Thanks for the quick answer. I'm aware of the domain-based approach, and for sure going to try it as well. My research is focused on gene duplication and family size dynamics, so I am wondering if maybe in my case similarity-based analysis is more informative. What do you think? What would be your advice (if any) for applying the similarity approach?
As I mentioned in my answer, you can use the similarity approach (BLAST) to identify gene families. But you will miss a lot of genes that can be grouped into gene families. The similarity approach would not able to identify divergent or distantly related homologs effectively. The domain-based approach is the best choice. For your understanding, you can try both approaches and identify the differences. You will identify more related genes into gene families by domain-based approach than similarity approach.