TL;DR: How can I convert a list of species names (common or scientific) into a corresponding list of NCBI database accession IDs to download the respective species’ reference genome from NCBI. The NCBI urls’ are not common or scientific name compatible so a regular python script for web scrapping that uses base url + item from list does not work. (My list of species is at least 200 species long, so I want to do this for automation’s sake)
What I am trying to do: Download the reference genomes/transcriptomes of some 200 species. My preferred source for this is the NCBI datasets (URL: https://www.ncbi.nlm.nih.gov/datasets/genomes/). I am trying to automate this just for convenience.
What I was hoping to do to automate the task: NCBI has a software package called “datasets” that can take the accession number ( the GCA or the RefSeq ID) and download a zipped data package that contains the genome, it is fairly easy to use so long as you have the accession number. To generate the accession IDs, I thought I would write a python web-scraping script. I would have written a script that takes a base url and then loops through a list of species names, then take the new list of urls and pass it to NCBI’s servers, take the html from the NCBI servers and use BeautifulSoup to look for the accession IDs. But alas, turns out the NCBI servers don’t take species names in the url but a specific taxon id. I guess this makes sense because the same species can have many different data files associated with it. So, for example if you want to get data from NCBI on common mice you will have to pass “taxon=10090” instead of “mouse” in the url. This is a problem I have not been able to work around because I do not know what the taxon ID for my 200 species are. I know mouse is mouse, I have no way of knowing that mouse is taxon=10090. I am looking for a resource to generate these taxon IDs.
What I tried: I tried to see if the Entrez Esearch utility (https://www.ncbi.nlm.nih.gov/books/NBK25501/) is the API that will help me get these IDs but it seems I might be barking up the wrong tree. The Entrez Esearch utility use url-based data retrieval, so you will take a base url like eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? and you can add code/terms to it like db=genome (database) and term=mouse to look for mouse to get a xml file on the list of mouse related files in the NCBI database so the final url for the xml file looks like this: eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=genome&term=house+mouse[orgn]. I can get a xml file for any species but these xml files do not seem to have Accession IDs for reference genomes. I tried looking at this table (https://www.ncbi.nlm.nih.gov/books/NBK25497/table/chapter2.T._entrez_unique_identifiers_ui/?report=objectonly) to see what might the database for reference genomes might be without much luck. How can I solve this issues?
Any help is much appreciated!
Hi, Mirian. Yes, you understood my query right. Your answer is exactly what I was looking for, so thank you so much. You even wrote out the bash script that I was planning to write. This makes my day so much easier! I cannot thank you enough.
Hey, Mirian. I have a follow-up question regarding a couple of the EntrezDirect tools.
TL;DR: When you provide a query item to Entrez's direct's Esearch, is there a way to search for similar terms (spelling-wise or context-wise) in NCBI's taxonomy database?
For example,
esearch -db taxonomy -query "physcomitrella [orgn]" | esummary
gives the following response,So, it seems that the query response is case insensitive but is there a way to make it less sensitive to spelling( perhaps look for the closest spelling from the command-line) or less sensitive to the exact name? I tried to look for an answer here https://www.ncbi.nlm.nih.gov/books/NBK179288/ , but could not find anything close to what I was looking for.
The elink tool has this description "Elink looks up precomputed neighbors within a database, or finds associated records in other databases", but I cannot get it to work on db = taxonomy based queries. I was trying to see if I could look for things in the NCBI taxonomy database based closes of names to the provided query. Or perhaps submit one of the common names of a species and get back the closest possible match in the taxonomy database. Going back to the above example, "Physcomitrella" is also known as "Physcomitrium" but I can only download it under the name "Physcomitrium" from the database.
Is there a way around this?
Hi Rijan,
I'm not sure if this is exactly what you're looking for, but we have a
taxon-suggest
in our REST API service. It works with scientific and common names, as well as taxids. It's a case-insentive, substring search, so it won't find anything that's misspelled. If you search forphyscomitr
with thehigher taxon
option, here's the result:Let me know if that helps, or if you have any questions. :)