Entering edit mode
2.6 years ago
From within Python, I want to be able to query BioMart to return a list containing information about genes and their homologs:
- Source species - Stable Protein ID
- Source species - Gene name
- Source species - Protein sequence
- Target species - Stable Protein ID of homolog
- Target species - Gene name of homolog
- Target species - Protein sequence of homolog
For example, say I was to input :
dataset = "Ensemble Genes 107"
target_species_dataset = "Elephant genes (Loxafr3.0)"
homolog_query = "Human"
How do I feed that into BioMart so that it spits out the six parameters I listed earlier?
Thanks so much in advance if anyone can help!
Python interface for Biomart API.
On top of Arup Ghosh's answer, you can also consider using the files available on the Ensembl FTP site:
Say we want all orthologous gene pairs between Human and Cow from the default Vertebrate ncRNA-trees. We could download the entire set of default Vertebrate ncRNA-trees homologies in one TSV file. For Ensembl 107 this would be located at:
This is a pretty massive file — 3.2 GB — but if we filter it to keep only the rows in which the 'homology_type' is an orthology (i.e. 'ortholog_one2one', 'ortholog_one2many' or 'ortholog_many2many'), while 'species' and 'homology_species' are 'homo_sapiens' and 'bos_taurus' (or vice versa), we will get a reasonably sized file of Human-Cow orthologues.
You can also use the language agnostic Ensembl REST API to retrieve orthologue data programmatically using the homology endpoints. E.g: http://rest.ensembl.org/documentation/info/homology_ensemblgene
Both tools are outdated and bad documented with examples on input and/or how to process output