Entering edit mode
2.3 years ago
ngarber
▴
60
From within Python, I want to be able to query BioMart to return a list containing information about genes and their homologs:
- Source species - Stable Protein ID
- Source species - Gene name
- Source species - Protein sequence
- Target species - Stable Protein ID of homolog
- Target species - Gene name of homolog
- Target species - Protein sequence of homolog
For example, say I was to input :
dataset = "Ensemble Genes 107"
target_species_dataset = "Elephant genes (Loxafr3.0)"
homolog_query = "Human"
How do I feed that into BioMart so that it spits out the six parameters I listed earlier?
Thanks so much in advance if anyone can help!
Python interface for Biomart API.
https://pypi.org/project/biomart/
https://pypi.org/project/pybiomart/
On top of Arup Ghosh's answer, you can also consider using the files available on the Ensembl FTP site:
Say we want all orthologous gene pairs between Human and Cow from the default Vertebrate ncRNA-trees. We could download the entire set of default Vertebrate ncRNA-trees homologies in one TSV file. For Ensembl 107 this would be located at:
This is a pretty massive file — 3.2 GB — but if we filter it to keep only the rows in which the 'homology_type' is an orthology (i.e. 'ortholog_one2one', 'ortholog_one2many' or 'ortholog_many2many'), while 'species' and 'homology_species' are 'homo_sapiens' and 'bos_taurus' (or vice versa), we will get a reasonably sized file of Human-Cow orthologues.
You can also use the language agnostic Ensembl REST API to retrieve orthologue data programmatically using the homology endpoints. E.g: http://rest.ensembl.org/documentation/info/homology_ensemblgene
Both tools are outdated and bad documented with examples on input and/or how to process output