Since you want to automate this, you'll have to use the NCBI E-utilities (the NCBI will blacklist users who script against the Entrez web interface), fortunately since you are using BioPython's Bio.Entrez (see the "Biopython Tutorial and Cookbook. Chapter 8 Accessing NCBI’s Entrez databases") this is already taken care of.
To bulk fetch entries you don't know the UID's (GI number in this case) for, you have to first get the UIDs. For this you use ESearch specifying the 'protein' database and a query to find the required entires. For example to find the Homo sapiens (NCBI TaxId=9606) entries from RefSeq the query is:
refseq[filter] AND txid9606[Organism]
This gives a result structure which contains the UIDs. EFetch takes a comma-separated list of UIDs, so extract the UIDs and construct the list, and then feed this to ESearch specifying the required format to get the data.
The following example Python script uses BioPython to fetch the proteins from Bovine papillomavirus 7 (NCBI TaxId=1001533) present in RefSeq in fasta sequence format:
from Bio import Entrez
entrezDbName = 'protein'
ncbiTaxId = '1001533' # Bovine papillomavirus 7
Entrez.email = 'email@example.org'
# Find entries matching the query
entrezQuery = "refseq[filter] AND txid%s"%(ncbiTaxId)
searchResultHandle = Entrez.esearch(db=entrezDbName, term=entrezQuery)
searchResult = Entrez.read(searchResultHandle)
searchResultHandle.close()
# Get the data.
uidList = ','.join(searchResult['IdList'])
entryData = Entrez.efetch(db=entrezDbName, id=uidList, rettype='fasta').read()
print entryData
While in this case the result is small, only 7 proteins, and thus using single step fetches is reasonable. For taxa with larger numbers of entries, you will want to retrieve the entry data in chunks rather then in one go, to avoid issues with time-outs, to limit the load on the NCBI's servers and to allow for checkpoints and retries in your own code. See "8.15 Using the history and WebEnv" for details of how to use the history capabilities of E-utilities from BioPython to simplify this process.
Alternatively there are many other resources which provide the RefSeq data, and provide combined query and fetch capabilities. For example:
-
RefSeq is available from Various public SRS servers (see Public SRS Installations). The EMBL-EBI's Linking to SRS guide documents how to use SRS via URLs and details of using URLs as an API to SRS. For the example above, using SRS@EBI, could be replaced with a call to the URL: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-view+FastaSeqs+[REFSEQP-NCBI_TaxId:1001533]
-
RefSeq is available on the main MRS server, and may be available on other MRS servers.
For an overview of using Python with web services see the Python section of the EMBL-EBI's "Introduction to Web Services". This includes links to the main documentation for the various tool-kits and tutorials for the most commonly used.
Thanks Hamish. I didn't know about the txid9606[Organism] construct. Now it works like a charm.
See also http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ - you can discover other useful fields to filter on via einfo.
Here's an example fetching viruses in GenBank format, https://github.com/peterjc/picobio/blob/master/fetch_viruses/fetch_viruses.py
Also, beware of chimeric records, http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html
I don't know if this will still get any attention, but I am running into RuntimeError: Search Backend failed:
or no elements found. Any thoughts?
I've gotten around this by not using so many variables. It's a little less clean, but it works.
It might be best to email the NCBI Entrez team with the full details - it sounds like something on their server is failing.