How to automatically find the right species name when there are typos, using Entrez.esearch?
1
0
Entering edit mode
7.3 years ago
Juan Cordero ▴ 140

Hi,

First, I'd like to get the lineage of certain species using biopython, but some of the species names (it's the only info I have about species) contain typos. For this reason, when I do:

handle = Entrez.esearch(db="Taxonomy", term=species_name, retmode="xml")
records = Entrez.read(handle)

records contains nothing in IdList. Does anyone know how to get, at least, the set of all species in NCBI to look for the most similar species name?

Thanks

Biopython Entrez taxonomy • 2.1k views
ADD COMMENT
2
Entering edit mode

Not sure how to get the full list of species in NCBI, but when you manage to do that, you can use the difflib library in Python to get close matching words (credit to David Robinson): https://stackoverflow.com/questions/11563615/matching-incorrectly-spelt-words-with-correct-ones-in-python?rq=1

ADD REPLY
0
Entering edit mode

and aspell in ubuntu

ADD REPLY
0
Entering edit mode

It is kind of a work around, but it could help you: You can use the API from the Ontology Lookup Service and use it's search function. By restricting to the ncbitaxon ontology you should get good enough hits. However, you most likely get multiple hits for your typos and therefore manual curation will be necessary (if you'd use 'exact match' you won't get any hits for your typos, thus being somewhat in the same situation as before): Docs: http://www.ebi.ac.uk/ols/docs/api ExampleQuery: http://www.ebi.ac.uk/ols/api/search?q=homo&queryFields=label&ontology=ncbitaxon

Lesson learned: Typos are bad. ;)

ADD REPLY
0
Entering edit mode
7.3 years ago
piet ★ 1.9k

You can download a dump of the taxonomy database from NCBI, it is updated several times a day.

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz

There is also a REDME file which explains the format of the ASCII files contained in the TAR ball.

ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump_readme.txt

After you have unpacked the TAR ball, you can grep through the file 'names.dmp':

grep -i INFLUENZA names.dmp | grep -i 'Hong.*Kong'
ADD COMMENT

Login before adding your answer.

Traffic: 1995 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6