How to retrieve the paralogues for a limited taxonomic group
2
0
Entering edit mode
8.6 years ago
Joseph Hughes ★ 3.0k

I am trying to retrieve in an automated fashion for a large number of gene the orthologues of a particular gene ensembl ID as well as the paralogues. Take for example ENSG00000258588, I am retrieving all the orthologues and prologues using

my $geneid="ENSG00000258588"; # ENSP00000346916
# Load the registry automatically
my $registry = 'Bio::EnsEMBL::Registry';
$registry->load_registry_from_db(
  -host=>'ensembldb.ensembl.org',
  -user=>'anonymous', 
);

## Get the compara gene member adaptor
my $gene_member_adaptor = $registry->get_adaptor("Multi", "compara", "GeneMember");

## Get the compara member
my $gene_member = $gene_member_adaptor->fetch_by_stable_id($geneid);
my @orthologIDs;
if (defined $gene_member){
  my $homology_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Homology');
  my $homologies = $homology_adaptor->fetch_all_by_Member($gene_member);
  my $member_adaptor = Bio::EnsEMBL::Registry->get_adaptor('Multi', 'compara', 'Member');
  foreach my $homology (@{$homologies}) {
    my @members = @{$homology->get_all_Members()};
    foreach my $this_member (@members) {
      my $orthologID=$this_member->stable_id;
      push(@orthologIDs,$orthologID);
    }
  }
}

However, the problem is that I end up with far more prologues than I want. Looking at the gene tree, I really only want ENSG00000258588 (TRIM6-TRIM34 ) and ENSG00000258659 (TRIM34). Essentially I want the close paralogues but not the distant paralogues such as TRIM22. In this particular case, I could limit the paralogues to the Ancestral taxonomy Homininae but I do not always know what the Ancestral Taxonomy will be, sometimes it might be Rodentia or Primates depending on when the paralogue arose. I am really only interested in post mammalian divergence events.

An alternative approach I have thought of but I do not know how to implement, is to look for the most recent common ancestor of all the orthologues and then from that node retrieve all the ensembl IDs.

Any pointers, advice, suggestions would be most gratefully appreciated.

ensembl perl API paralogues taxonomy • 2.6k views
ADD COMMENT
3
Entering edit mode
8.6 years ago

Hi Joseph,

The Homology object ($homology) has a taxonomy_level() method that returns the name of the LCA of the pair of genes.

There is also a species_tree_node() method which maps back to a node in the species-tree. Each node has a taxon() method that links to the NCBI-taxonomy, a name(), but you can also directly compare nodes with has_ancestor().

Matthieu, Ensembl Compara

ADD COMMENT
0
Entering edit mode

With taxonomy_level() I am going to have to provide a long list of taxonomy names to exclude or include to make sure I have the paralogues I want to include. Where can I find out more about the species_tree_node() method? Thanks

ADD REPLY
1
Entering edit mode

If you are interested only in the human lineage there are not that many taxonomy levels: mammalia, theria, eutheria, boroeutheria, euarchontoglires, primates... etc. You can get the names easily from: http://www.genomicus.biologie.ens.fr/genomicus-84.01/data/SpeciesTree.pdf

Or you can use the newick species tree: http://www.genomicus.biologie.ens.fr/genomicus-84.01/data/SpeciesTree.nwk to automate the process

ADD REPLY
1
Entering edit mode

I'm not sure what rule you'd like to use. The full doc is here: http://www.ensembl.org/info/docs/Doxygen/compara-api/classBio_1_1EnsEMBL_1_1Compara_1_1SpeciesTreeNode.html There are a lot of methods in this module, which can do many things with tree / graph structures

One approach is to do $species_tree_node->taxon()->classification which returns a string like "(...) mammalia theria eutheria boroeutheria euarchontoglires primates (...) homo" and for instance if you want ancestors below the Theria node, you can do: $classification =~ / theria / You can also do $species_tree_node->get_all_ancestors() which returns all the nodes above the current one, and you can check the name, taxon_id, etc of each one of them.

The best way really depends of the filtering you want to apply. You mentioned you only want the most recent paralogue. Is that correct ?

Genomics used to use a different species-tree because we/they wanted to add the "Boreoeutheria" node, but since the NCBI have added it, I think the species-trees are now identical

Matthieu, Ensembl Compara and ex-Genomicus :)

ADD REPLY
0
Entering edit mode

Thanks this is all very useful. I will investigate further with these leads. Thanks.

ADD REPLY
0
Entering edit mode
8.6 years ago
abascalfederico ★ 1.2k

You can get all the paralogues and then select only those sharing a last common ancestor with the query gene at certain levels: mammalia, theria, eutheria, etc.

I don't know how the "last common ancestor" information is stored within the homology object in the API, but using Biomart is very easy to get the full list of human genes, their paralogs and the LCAs

ADD COMMENT

Login before adding your answer.

Traffic: 2088 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6