I am trying to retrieve in an automated fashion for a large number of gene the orthologues of a particular gene ensembl ID as well as the paralogues. Take for example ENSG00000258588, I am retrieving all the orthologues and prologues using
my $geneid="ENSG00000258588"; # ENSP00000346916
# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
-host=>'ensembldb.ensembl.org',
-user=>'anonymous',
);
## Get the compara gene member adaptor
my $gene_member_adaptor = $registry->get_adaptor("Multi", "compara", "GeneMember");
## Get the compara member
my $gene_member = $gene_member_adaptor->fetch_by_stable_id($geneid);
my @orthologIDs;
if (defined $gene_member){
my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Homology');
my $homologies = $homology_adaptor->fetch_all_by_Member($gene_member);
my $member_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Member');
foreach my $homology (@{$homologies}) {
my @members = @{$homology->get_all_Members()};
foreach my $this_member (@members) {
my $orthologID=$this_member->stable_id;
push(@orthologIDs,$orthologID);
}
}
}
However, the problem is that I end up with far more prologues than I want. Looking at the gene tree, I really only want ENSG00000258588 (TRIM6-TRIM34 ) and ENSG00000258659 (TRIM34). Essentially I want the close paralogues but not the distant paralogues such as TRIM22. In this particular case, I could limit the paralogues to the Ancestral taxonomy Homininae but I do not always know what the Ancestral Taxonomy will be, sometimes it might be Rodentia or Primates depending on when the paralogue arose. I am really only interested in post mammalian divergence events.
An alternative approach I have thought of but I do not know how to implement, is to look for the most recent common ancestor of all the orthologues and then from that node retrieve all the ensembl IDs.
Any pointers, advice, suggestions would be most gratefully appreciated.
With taxonomy_level() I am going to have to provide a long list of taxonomy names to exclude or include to make sure I have the paralogues I want to include. Where can I find out more about the species_tree_node() method? Thanks
If you are interested only in the human lineage there are not that many taxonomy levels: mammalia, theria, eutheria, boroeutheria, euarchontoglires, primates... etc. You can get the names easily from: http://www.genomicus.biologie.ens.fr/genomicus-84.01/data/SpeciesTree.pdf
Or you can use the newick species tree: http://www.genomicus.biologie.ens.fr/genomicus-84.01/data/SpeciesTree.nwk to automate the process
I'm not sure what rule you'd like to use. The full doc is here: http://www.ensembl.org/info/docs/Doxygen/compara-api/classBio_1_1EnsEMBL_1_1Compara_1_1SpeciesTreeNode.html There are a lot of methods in this module, which can do many things with tree / graph structures
One approach is to do $species_tree_node->taxon()->classification which returns a string like "(...) mammalia theria eutheria boroeutheria euarchontoglires primates (...) homo" and for instance if you want ancestors below the Theria node, you can do: $classification =~ / theria / You can also do $species_tree_node->get_all_ancestors() which returns all the nodes above the current one, and you can check the name, taxon_id, etc of each one of them.
The best way really depends of the filtering you want to apply. You mentioned you only want the most recent paralogue. Is that correct ?
Genomics used to use a different species-tree because we/they wanted to add the "Boreoeutheria" node, but since the NCBI have added it, I think the species-trees are now identical
Matthieu, Ensembl Compara and ex-Genomicus :)
Thanks this is all very useful. I will investigate further with these leads. Thanks.