Hi,
First, I'd like to get the lineage of certain species using biopython, but some of the species names (it's the only info I have about species) contain typos. For this reason, when I do:
handle = Entrez.esearch(db="Taxonomy", term=species_name, retmode="xml")
records = Entrez.read(handle)
records
contains nothing in IdList
. Does anyone know how to get, at least, the set of all species in NCBI to look for the most similar species name?
Thanks
Not sure how to get the full list of species in NCBI, but when you manage to do that, you can use the difflib library in Python to get close matching words (credit to David Robinson): https://stackoverflow.com/questions/11563615/matching-incorrectly-spelt-words-with-correct-ones-in-python?rq=1
and aspell in ubuntu
It is kind of a work around, but it could help you: You can use the API from the Ontology Lookup Service and use it's search function. By restricting to the ncbitaxon ontology you should get good enough hits. However, you most likely get multiple hits for your typos and therefore manual curation will be necessary (if you'd use 'exact match' you won't get any hits for your typos, thus being somewhat in the same situation as before): Docs: http://www.ebi.ac.uk/ols/docs/api ExampleQuery: http://www.ebi.ac.uk/ols/api/search?q=homo&queryFields=label&ontology=ncbitaxon
Lesson learned: Typos are bad. ;)