Question

Fetching Genbank Entries For List Of Accession Numbers.

12

Entering edit mode

12.1 years ago

Sanjar ▴ 150

I have a looong list of accession numbers, for which I need to fetch genbank entries. Usually, I used to use epost-efetch workflow for long lists. But now I can not, because epost doesn't accept accession numbers as IDs.

So the question is:

a) is there a way of batch downloading using accession numbers? if not b) is there a way of mapping Accession number to normal ids?

ncbi entrez • 40k views

ADD COMMENT • link updated 4.0 years ago by d.s.account ▴ 10 • written 12.1 years ago by Sanjar ▴ 150

0

Entering edit mode

can you give some example of your accession number? where is it from?

ADD REPLY • link 12.1 years ago by Leszek 4.2k

0

Entering edit mode

Ex: A22237,A22239,A32021,A32022,A33397 Those are accessions from NCBI. When you post them using epost, it gives this error: "IDs contain invalid characters which was treated as delimiters." So it appears to me that epost doesn't accept non-numeric characters for ID field. I tried to change the letters to their ascii codes, didn't help.

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

0

Entering edit mode

One more thing that I noticed today: All the BioXXX libraries just stop when they get error from epost. But I noticed that along with error, epost still returns WebEnv and query_key. But what it does is that, it takes the accession number, trims out the non-numeric characters, and searches for resultant GID. So, A22237 turns to 22237. Don't know what to do. Such a tiny problem taking up lot's of time.

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

score 8 · Answer 1 · 2016-04-24

Python 3 + BioPython with command-line interface. Basically, a less hardcoded and flexible version of Leszek's code, that doesn't fail when the list of accessions is so big, that GI retrieval results in HTTP Error 414: Request-URI Too Large

#! /usr/bin/env python3


import argparse
import sys
import os

import Bio.Entrez


RETMAX = 10**9
GB_EXT = ".gb"


def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        default="some_email@somedomain.com",
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)


def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]


def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
            else:
                buffer.append(line)
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list = Bio.Entrez.read(query_handle)['IdList']

        # get GB files
        search_handle = Bio.Entrez.epost(db=db, id=",".join(gi_list))
        search_results = Bio.Entrez.read(search_handle)
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)


def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)


def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
        os.makedirs(op_dir)
    dbase = args.database
    Bio.Entrez.email = args.email
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)


if __name__ == "__main__":
    main(sys.argv[1:])

score 7 · Answer 2 · 2016-05-31

7

Entering edit mode

8.9 years ago

ifreecell ▴ 220

I have successfully used wget to download more than 1000 plastid Genbank files using Accessions (with version, NC_018523.1). It's not fast, but works.

mkdir -p gb fa
cut -f 1 plastid.accs | while read -r acc
do
    wget -O gb/"${acc}.gb" \
       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=gb&sendto=on&id=$acc"
#    sleep 1s
#    wget -O fa/"${acc}.fa" \
#       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&sendto=on&id=$acc"
done

To map accession to gi, you could try

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz # ~800MB
gunzip nucl_gb.accession2taxid.gz
grep -f plastid.accs nucl_gb.accession2taxid | cut -f  1, 4 > Acc_GI.csv

ADD COMMENT • link 8.9 years ago by ifreecell ▴ 220

0

Entering edit mode

I have the same question . I try your method. It works.But If I want to download the full genbank file . what should I do? change the " dopt=gb"? what type should I use?

ADD REPLY • link 8.7 years ago by mark900830 • 0

0

Entering edit mode

Replacing dopt=gb with dopt=gbwithparts should work, but I have not try yet.

ADD REPLY • link 8.6 years ago by ifreecell ▴ 220

0

Entering edit mode

It's not fast, but works.

ADD REPLY • link 6.8 years ago by Prakki Rama ★ 2.7k

score 5 · Answer 3 · 2013-03-22

5

Entering edit mode

12.1 years ago

Leszek 4.2k

You can try this:

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions. 

USAGE:
  python acc2gb.py A22237 A22239 A32021 A32022 A33397 > out.gb
or
  cat ids | python acc2gb.py > out.gb

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
  try:
    sys.stderr.write( " %9i %s          \r" % (i+1,acc))  
    handle = Entrez.efetch(db=db, rettype="gb", id=acc)
    #print output to stdout
    sys.stdout.write(handle.read())
  except:
    sys.stderr.write( "Error! Cannot fetch: %s        \n" % acc)

EDIT
The same using epost:

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #print output to stdout
  sys.stdout.write(handle.read())

ADD COMMENT • link 12.1 years ago by Leszek 4.2k

0

Entering edit mode

I usually use this way. But it's extremely slow when you want to download thousands of entries and sometimes falls to timeout. Using epost would be much better. But, apparently there's no way of doing that.

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

0

Entering edit mode

you can do epost easily with biopython - you just have to be sure to provide valid accessions, otherwise the script will crash... have a look here how to do it: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114

ADD REPLY • link 12.1 years ago by Leszek 4.2k

0

Entering edit mode

I'm having the same issue with epost not taking accession numbers but integer UIDs instead...

ADD REPLY • link 11.9 years ago by junyinglim • 0

0

Entering edit mode

Look at edited option: you can provide any id that is accepted by NCBI and the script will get UIDs of these automatically and print fasta...

ADD REPLY • link 11.9 years ago by Leszek 4.2k

0

Entering edit mode

This works fantastic! Just couldn't think myself of searching by accessions and getting ['IdList'] back. Thanks Leszek.

ADD REPLY • link 11.8 years ago by Sanjar ▴ 150

1

Entering edit mode

you can find more tricks in biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc110

ADD REPLY • link 11.8 years ago by Leszek 4.2k

0

Entering edit mode

@ Leszec I want to retrieve gene symbols using RefSeq protein(stored in text file) how can i do this using the program when i run the script saving it as .py its showing "SYNTAX ERROR" I use Python2.7 Help me to work on the script Thanks in advance

ADD REPLY • link 8.0 years ago by lakshmi.bioinformatics ▴ 30

0

Entering edit mode

Thank you for this solution. But its now 8 years later, is there really no better way to do this?

ADD REPLY • link 4.0 years ago by d.s.account ▴ 10

score 2 · Answer 4 · 2013-03-22

2

Entering edit mode

12.1 years ago

Daniel ★ 4.0k

The hit-it-on-the-head-with-a-hammer alternative is to batch download a whole section of genbank, then filter out the ones you want using bioperl which will take accessions.

i.e. something along the lines of

    foreach my $acc (@acc){
            if ($seq->accession eq $acc){
                       $fileout->write_seq($seq);           ##etc. etc. etc.
            }

ADD COMMENT • link 12.1 years ago by Daniel ★ 4.0k

0

Entering edit mode

Any suggestions on how to download the whole data of genbank?

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

0

Entering edit mode

There is the ftp if you want to batch DL ftp://ftp.ncbi.nlm.nih.gov/genbank/ But if you're only concerned about a certain taxa (prok, euk, human, whatever) you can just browser download it via http://www.ncbi.nlm.nih.gov/nuccore

ADD REPLY • link 12.1 years ago by Daniel ★ 4.0k

score 0 · Answer 5 · 2013-03-20

0

Entering edit mode

12.1 years ago

secretjess ▴ 210

I'm not too sure which accession number you mean and what normal ID you are referring to but http://david.abcc.ncifcrf.gov/conversion.jsp has the ability to map between various accessions (e.g. genbank, uniprot) and IDs (e.g. entrez, ensembl)

ADD COMMENT • link 12.1 years ago by secretjess ▴ 210

0

Entering edit mode

I'm working with GenBank. So, I need to map Genbank accession numbers to Genbank GI number. DAVID can't find the majority of the accessions that I submit. Moreover, it doesn't provide API, afaik.

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

0

Entering edit mode

Ah, sorry that wasn't helpful. This looks like a similar question to yours (in reverse) so maybe the solution will be more useful: Map Genbank Gi To Accession Numbers (A Million Times)

ADD REPLY • link 12.1 years ago by secretjess ▴ 210

1

Entering edit mode

I had a look on that thread. The problem is not solved there. If I had GIDs first, and wanted to convert them to Accessions, then that would be viable via epost-efetch:). But can't do the other way around. Thank you for your suggestions.

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

score 0 · Answer 6 · 2013-03-21

0

Entering edit mode

12.1 years ago

Wen.Huang ★ 1.2k

two not so specific cents:

1) NCBI has a well documented but under-used collection of E-utilities you may want to try

2) Bio-perl has modules to fetch GenBank entries using different IDs.

ADD COMMENT • link 12.1 years ago by Wen.Huang ★ 1.2k

0

Entering edit mode

1) Yes, I like very much working with E-utils. It's the first time I got stuck with a problem. 2) True. There's no problem in fetching by accession number in one-request-at-a-time manner. But it's slow, due to 1/3 seconds limitation of NCBI. The problem comes when you want to try epost for batch downloading. Thank you for your suggestions!

ADD REPLY • link 12.1 years ago by Sanjar ▴ 150

score 0 · Answer 7 · 2013-03-22

Depends on how long "long" is, but up to some length, you can even use the Entrez web interface to get these e.g. querying with "A22237 A22239 A32021 A32022 A33397" here http://www.ncbi.nlm.nih.gov gives you a link to this page:

http://www.ncbi.nlm.nih.gov/nuccore/A22237A22239A32021A32022A33397

and you can get these in a range of formats e.g. full genbank, fasta, etc., using the "Send to" functionality

score 0 · Answer 8 · 2013-07-03

0

Entering edit mode

11.8 years ago

Sanjar ▴ 150

The script provided by Leszek solves the problem.

ADD COMMENT • link 11.8 years ago by Sanjar ▴ 150