Refseq Proteins For A Given Taxid
3
5
Entering edit mode
12.8 years ago
Chris ★ 1.6k

Hi,

I've got the following problem:

Given a NCBI taxid, I'd like to bulk-download all RefSeq protein sequences for that species. The ftp server seems to provide fasta files for select species such as human. However,the majority seems to be concatenated in huge fasta files organised by vertebrate, invertebrate, ... Of course I could download all of those and parse them for taxid, but given the vast size this seems infeasible to me.

Now, at least I'm able to do that using Entrez via the NCBI homepage. However, I need to do this programmatically, preferable using Python/BioPython.

I've already found a way to retrieve _single_ sequences using a RefSeq Accession which is very slow when iterating over 1000s of accessions:

from Bio import Entrez
#acc: some RefSeq accession, ver: its version
rec = Entrez.read(Entrez.esearch(db="protein", term="%s.%s"%(acc,ver) ))
fasta = Entrez.efetch(db="protein", id=rec["IdList"][0], rettype="fasta").read()

Is there a similar way to bulk-retrieve all sequences for a given taxid?

Thanks,
Chris

protein biopython refseq • 12k views
ADD COMMENT
4
Entering edit mode
12.8 years ago
Hamish ★ 3.3k

Since you want to automate this, you'll have to use the NCBI E-utilities (the NCBI will blacklist users who script against the Entrez web interface), fortunately since you are using BioPython's Bio.Entrez (see the "Biopython Tutorial and Cookbook. Chapter 8 Accessing NCBI’s Entrez databases") this is already taken care of.

To bulk fetch entries you don't know the UID's (GI number in this case) for, you have to first get the UIDs. For this you use ESearch specifying the 'protein' database and a query to find the required entires. For example to find the Homo sapiens (NCBI TaxId=9606) entries from RefSeq the query is:

refseq[filter] AND txid9606[Organism]

This gives a result structure which contains the UIDs. EFetch takes a comma-separated list of UIDs, so extract the UIDs and construct the list, and then feed this to ESearch specifying the required format to get the data.

The following example Python script uses BioPython to fetch the proteins from Bovine papillomavirus 7 (NCBI TaxId=1001533) present in RefSeq in fasta sequence format:

from Bio import Entrez

entrezDbName = 'protein'
ncbiTaxId = '1001533' # Bovine papillomavirus 7
Entrez.email = 'email@example.org'

# Find entries matching the query
entrezQuery = "refseq[filter] AND txid%s"%(ncbiTaxId)
searchResultHandle = Entrez.esearch(db=entrezDbName, term=entrezQuery)
searchResult = Entrez.read(searchResultHandle)
searchResultHandle.close()

# Get the data.
uidList = ','.join(searchResult['IdList'])
entryData = Entrez.efetch(db=entrezDbName, id=uidList, rettype='fasta').read()
print entryData

While in this case the result is small, only 7 proteins, and thus using single step fetches is reasonable. For taxa with larger numbers of entries, you will want to retrieve the entry data in chunks rather then in one go, to avoid issues with time-outs, to limit the load on the NCBI's servers and to allow for checkpoints and retries in your own code. See "8.15 Using the history and WebEnv" for details of how to use the history capabilities of E-utilities from BioPython to simplify this process.

Alternatively there are many other resources which provide the RefSeq data, and provide combined query and fetch capabilities. For example:

  1. RefSeq is available from Various public SRS servers (see Public SRS Installations). The EMBL-EBI's Linking to SRS guide documents how to use SRS via URLs and details of using URLs as an API to SRS. For the example above, using SRS@EBI, could be replaced with a call to the URL: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-view+FastaSeqs+[REFSEQP-NCBI_TaxId:1001533]

  2. RefSeq is available on the main MRS server, and may be available on other MRS servers.

For an overview of using Python with web services see the Python section of the EMBL-EBI's "Introduction to Web Services". This includes links to the main documentation for the various tool-kits and tutorials for the most commonly used.

ADD COMMENT
0
Entering edit mode

Thanks Hamish. I didn't know about the txid9606[Organism] construct. Now it works like a charm.

ADD REPLY
0
Entering edit mode

See also http://news.open-bio.org/news/2009/06/ncbi-einfo-biopython/ - you can discover other useful fields to filter on via einfo.

Here's an example fetching viruses in GenBank format, https://github.com/peterjc/picobio/blob/master/fetch_viruses/fetch_viruses.py

Also, beware of chimeric records, http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

ADD REPLY
0
Entering edit mode

I don't know if this will still get any attention, but I am running into RuntimeError: Search Backend failed:

or no elements found. Any thoughts?

I've gotten around this by not using so many variables. It's a little less clean, but it works.

ADD REPLY
0
Entering edit mode

It might be best to email the NCBI Entrez team with the full details - it sounds like something on their server is failing.

ADD REPLY
4
Entering edit mode
10.8 years ago

Using the recently released Entrez command-line utilities, you can use:

 esearch -db protein -query "refseq[filter] AND txid9606[Organism]" | efetch -format fasta > human.refseq.sequences

Alternatively, just look at the Refseq's FTP site.

ADD COMMENT
2
Entering edit mode
12.8 years ago

Use the property srcdb_refseq

Search ncbi protein for

"Homo Sapiens"[ORGN] AND srcdb_refseq[Properties]

go to

http://www.ncbi.nlm.nih.gov/protein?term=%22Homo%20Sapiens%22[ORGN]%20AND%20srcdb_refseq[Properties]

Send to/File/Fasta

You can also use NCBI ESearch/EFetch for the same query.

ADD COMMENT

Login before adding your answer.

Traffic: 1750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6