Question

How to automatically find the right species name when there are typos, using Entrez.esearch?

0

Entering edit mode

8.1 years ago

Juan Cordero ▴ 140

Hi,

First, I'd like to get the lineage of certain species using biopython, but some of the species names (it's the only info I have about species) contain typos. For this reason, when I do:

handle = Entrez.esearch(db="Taxonomy", term=species_name, retmode="xml")
records = Entrez.read(handle)

records contains nothing in IdList. Does anyone know how to get, at least, the set of all species in NCBI to look for the most similar species name?

Thanks

Biopython Entrez taxonomy • 2.5k views

ADD COMMENT • link updated 8.1 years ago by piet ★ 1.9k • written 8.1 years ago by Juan Cordero ▴ 140

2

Entering edit mode

Not sure how to get the full list of species in NCBI, but when you manage to do that, you can use the difflib library in Python to get close matching words (credit to David Robinson): https://stackoverflow.com/questions/11563615/matching-incorrectly-spelt-words-with-correct-ones-in-python?rq=1

ADD REPLY • link 8.1 years ago by James Ashmore ★ 3.5k

0

Entering edit mode

and aspell in ubuntu

ADD REPLY • link 8.1 years ago by cpad0112 21k

0

Entering edit mode

It is kind of a work around, but it could help you: You can use the API from the Ontology Lookup Service and use it's search function. By restricting to the ncbitaxon ontology you should get good enough hits. However, you most likely get multiple hits for your typos and therefore manual curation will be necessary (if you'd use 'exact match' you won't get any hits for your typos, thus being somewhat in the same situation as before): Docs: http://www.ebi.ac.uk/ols/docs/api ExampleQuery: http://www.ebi.ac.uk/ols/api/search?q=homo&queryFields=label&ontology=ncbitaxon

Lesson learned: Typos are bad. ;)

ADD REPLY • link 8.1 years ago by LLTommy ★ 1.2k

score 0 · Answer 1 · 2017-07-20

You can download a dump of the taxonomy database from NCBI, it is updated several times a day.

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

There is also a REDME file which explains the format of the ASCII files contained in the TAR ball.

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt

After you have unpacked the TAR ball, you can grep through the file 'names.dmp':

grep -i INFLUENZA names.dmp | grep -i 'Hong.*Kong'