Dear all,
I used reciprocal best Blast hit method to find orthologous proteins from many species. My problem is some individual cluster contains too many proteins.
I make an example below. Each pair shows reciprocal BLAST top hits result (take the first line for example, the best hit of species1 protein A is species2 protein B, meanwhile, the best hit of species2 protein B is species1 protein A), then a cluster contains A,B,C,D,E,F,G, because all of these proteins are connected some way. With the number of genes and species increasing, I find some clusters are huge (thousands of proteins within a cluster). I am asking you how to filter the result, that is to make some huge clusters smaller in size?
THANK YOU!
species1 protein A <--> species2 protein B
species1 protein A <--> species3 protein C
species1 protein D <--> species4 protein E
species2 protein B <--> species3 protein F
species2 protein G <--> species4 protein E
species3 protein F <--> species4 protein E
If you believe that reciprocal best hits give you valid orthologs then I see no reason for splitting the groups, the genes in each group are orthologs. If you suspect the method is inaccurate for some reason, then you need to build a proper phylogenetic tree. To deal with the amount of data, you could try building a tree for each cluster. Also you might want to use protein-guided nucleic acid sequence alignments for this.
Hi, Jean-Karim Heriche, Thank you for your comments.
What kind of data do you have? Transcriptome assemblies? Predicted genes from your draft genome assemblies? Genes downloaded from NCBI?
Hi h.mon, they are cds sequences downloaded from ENSEMBL.
In that case, why don't you use the orthology inference from EnsEMBL Compara ?
Thanks for your comments. Actually I tried BioMart, but due to large number of species, the orthologous pair dataset is huge, and I cannot download it. That's why I sought to use RBH approach.
Use the API then. As an alternative, you can probably also find the same species in TreeFam.
Thanks a lot, Jean-Karim Heriche.