Question

How to group genes in a single genome into gene families?

1

Entering edit mode

5.0 years ago

liorglic ★ 1.4k

Hello,
I'd like to group genes of a specific genome (tomato in my case) into gene families. I am using sequence similarity (not domain), specifically with the software OrthoFinder.
The input I use is the whole proteome of the species. Interestingly, I am only getting ~18% of genes grouped into gene families. The rest of the genes are singletons. This is in contrast to most works and DBs I see, where most genes (sometimes ~80%) belong to gene families. In most cases I see people using multiple genomes of various species rather than a single genome. I suspect that this is the reason for the low extent of clustering, since addition of more genomes can create new graph edges.
My question is what would be the right way to achieve my goal, which is just clustering genes from a single genome. Should I:
1) stick with what I'm doing, since this is correct from the perspective of a single genome?
2) include proteins from other species in the analysis? This seems a bit strange and would mean that my results are dependent on the species I choose to include.
3) Use some strategy that assigns genes to pre-defined gene families rather than cluster from scratch?
4) something else?

Thanks!

gene family • 1.5k views

ADD COMMENT • link updated 5.0 years ago by Renesh ★ 2.2k • written 5.0 years ago by liorglic ★ 1.4k

score 1 · Answer 1 · 2019-11-27

1

Entering edit mode

5.0 years ago

Renesh ★ 2.2k

You need to use a domain-based (PFAM) database to identify gene families. The highly conserved domains define protein functions and classify protein-coding genes into gene families. The conserved signature protein domains have the ability to detect the divergent or distantly related homologs which would be prohibitive with sequence-based similarity analysis tools e.g. BLAST. The domain-based search method would identify more genes belonging to gene families than BLAST-based homology search.

Read this manuscript: https://www.biorxiv.org/content/early/2019/08/28/272187.full.pdf

Web Tool: http://mandadilab.webfactional.com/home/

ADD COMMENT • link 5.0 years ago by Renesh ★ 2.2k

0

Entering edit mode

Thanks for the quick answer. I'm aware of the domain-based approach, and for sure going to try it as well. My research is focused on gene duplication and family size dynamics, so I am wondering if maybe in my case similarity-based analysis is more informative. What do you think? What would be your advice (if any) for applying the similarity approach?

ADD REPLY • link 5.0 years ago by liorglic ★ 1.4k

0

Entering edit mode

As I mentioned in my answer, you can use the similarity approach (BLAST) to identify gene families. But you will miss a lot of genes that can be grouped into gene families. The similarity approach would not able to identify divergent or distantly related homologs effectively. The domain-based approach is the best choice. For your understanding, you can try both approaches and identify the differences. You will identify more related genes into gene families by domain-based approach than similarity approach.

ADD REPLY • link 5.0 years ago by Renesh ★ 2.2k