orthologous proteins clustering
0
0
Entering edit mode
9.3 years ago
biolab ★ 1.4k

Dear all,

I used reciprocal best Blast hit method to find orthologous proteins from many species. My problem is some individual cluster contains too many proteins.

I make an example below. Each pair shows reciprocal BLAST top hits result (take the first line for example, the best hit of species1 protein A is species2 protein B, meanwhile, the best hit of species2 protein B is species1 protein A), then a cluster contains A,B,C,D,E,F,G, because all of these proteins are connected some way. With the number of genes and species increasing, I find some clusters are huge (thousands of proteins within a cluster). I am asking you how to filter the result, that is to make some huge clusters smaller in size?

THANK YOU!

species1 protein A <--> species2 protein B
species1 protein A <--> species3 protein C
species1 protein D <--> species4 protein E
species2 protein B <--> species3 protein F
species2 protein G <--> species4 protein E
species3 protein F <--> species4 protein E
Reciprocal-Best-Hits blast • 2.6k views
ADD COMMENT
1
Entering edit mode

If you believe that reciprocal best hits give you valid orthologs then I see no reason for splitting the groups, the genes in each group are orthologs. If you suspect the method is inaccurate for some reason, then you need to build a proper phylogenetic tree. To deal with the amount of data, you could try building a tree for each cluster. Also you might want to use protein-guided nucleic acid sequence alignments for this.

ADD REPLY
0
Entering edit mode

Hi, Jean-Karim Heriche, Thank you for your comments.

ADD REPLY
0
Entering edit mode

What kind of data do you have? Transcriptome assemblies? Predicted genes from your draft genome assemblies? Genes downloaded from NCBI?

ADD REPLY
0
Entering edit mode

Hi h.mon, they are cds sequences downloaded from ENSEMBL.

ADD REPLY
0
Entering edit mode

In that case, why don't you use the orthology inference from EnsEMBL Compara ?

ADD REPLY
0
Entering edit mode

Thanks for your comments. Actually I tried BioMart, but due to large number of species, the orthologous pair dataset is huge, and I cannot download it. That's why I sought to use RBH approach.

ADD REPLY
1
Entering edit mode

Use the API then. As an alternative, you can probably also find the same species in TreeFam.

ADD REPLY
0
Entering edit mode

Thanks a lot, Jean-Karim Heriche.

ADD REPLY

Login before adding your answer.

Traffic: 2551 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6