Question

Refseq Proteins For A Given Taxid

5

Entering edit mode

13.4 years ago

Chris ★ 1.6k

Hi,

I've got the following problem:

Given a NCBI taxid, I'd like to bulk-download all RefSeq protein sequences for that species. The ftp server seems to provide fasta files for select species such as human. However,the majority seems to be concatenated in huge fasta files organised by vertebrate, invertebrate, ... Of course I could download all of those and parse them for taxid, but given the vast size this seems infeasible to me.

Now, at least I'm able to do that using Entrez via the NCBI homepage. However, I need to do this programmatically, preferable using Python/BioPython.

I've already found a way to retrieve _single_ sequences using a RefSeq Accession which is very slow when iterating over 1000s of accessions:

from Bio import Entrez
#acc: some RefSeq accession, ver: its version
rec = Entrez.read(Entrez.esearch(db="protein", term="%s.%s"%(acc,ver) ))
fasta = Entrez.efetch(db="protein", id=rec["IdList"][0], rettype="fasta").read()

Is there a similar way to bulk-retrieve all sequences for a given taxid?

Thanks,
Chris

protein biopython refseq • 12k views

ADD COMMENT • link updated 2.4 years ago by Ram 45k • written 13.4 years ago by Chris ★ 1.6k

Ram · Answer 1 · 2012-02-24

Since you want to automate this, you'll have to use the NCBI E-utilities (the NCBI will blacklist users who script against the Entrez web interface), fortunately since you are using BioPython's Bio.Entrez (see the "Biopython Tutorial and Cookbook. Chapter 8 Accessing NCBI’s Entrez databases") this is already taken care of.

To bulk fetch entries you don't know the UID's (GI number in this case) for, you have to first get the UIDs. For this you use ESearch specifying the 'protein' database and a query to find the required entires. For example to find the Homo sapiens (NCBI TaxId=9606) entries from RefSeq the query is:

refseq[filter] AND txid9606[Organism]

This gives a result structure which contains the UIDs. EFetch takes a comma-separated list of UIDs, so extract the UIDs and construct the list, and then feed this to ESearch specifying the required format to get the data.

The following example Python script uses BioPython to fetch the proteins from Bovine papillomavirus 7 (NCBI TaxId=1001533) present in RefSeq in fasta sequence format:

from Bio import Entrez

entrezDbName = 'protein'
ncbiTaxId = '1001533' # Bovine papillomavirus 7
Entrez.email = 'email@example.org'

# Find entries matching the query
entrezQuery = "refseq[filter] AND txid%s"%(ncbiTaxId)
searchResultHandle = Entrez.esearch(db=entrezDbName, term=entrezQuery)
searchResult = Entrez.read(searchResultHandle)
searchResultHandle.close()

# Get the data.
uidList = ','.join(searchResult['IdList'])
entryData = Entrez.efetch(db=entrezDbName, id=uidList, rettype='fasta').read()
print entryData

While in this case the result is small, only 7 proteins, and thus using single step fetches is reasonable. For taxa with larger numbers of entries, you will want to retrieve the entry data in chunks rather then in one go, to avoid issues with time-outs, to limit the load on the NCBI's servers and to allow for checkpoints and retries in your own code. See "8.15 Using the history and WebEnv" for details of how to use the history capabilities of E-utilities from BioPython to simplify this process.

Alternatively there are many other resources which provide the RefSeq data, and provide combined query and fetch capabilities. For example:

RefSeq is available from Various public SRS servers (see Public SRS Installations). The EMBL-EBI's Linking to SRS guide documents how to use SRS via URLs and details of using URLs as an API to SRS. For the example above, using SRS@EBI, could be replaced with a call to the URL: http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-noSession+-view+FastaSeqs+[REFSEQP-NCBI_TaxId:1001533]
RefSeq is available on the main MRS server, and may be available on other MRS servers.

For an overview of using Python with web services see the Python section of the EMBL-EBI's "Introduction to Web Services". This includes links to the main documentation for the various tool-kits and tutorials for the most commonly used.

score 4 · Answer 2 · 2014-03-04

4

Entering edit mode

11.4 years ago

Giovanni M Dall'Olio 28k

Using the recently released Entrez command-line utilities, you can use:

 esearch -db protein -query "refseq[filter] AND txid9606[Organism]" | efetch -format fasta > human.refseq.sequences

Alternatively, just look at the Refseq's FTP site.

ADD COMMENT • link 11.4 years ago by Giovanni M Dall'Olio 28k

score 2 · Answer 3 · 2012-02-23

2

Entering edit mode

13.4 years ago

Pierre Lindenbaum 166k

Use the property srcdb_refseq

Search ncbi protein for

"Homo Sapiens"[ORGN] AND srcdb_refseq[Properties]

go to

http://www.ncbi.nlm.nih.gov/protein?term=%22Homo%20Sapiens%22[ORGN]%20AND%20srcdb_refseq[Properties]

Send to/File/Fasta

You can also use NCBI ESearch/EFetch for the same query.

ADD COMMENT • link 13.4 years ago by Pierre Lindenbaum 166k