Fetching Genbank Entries For List Of Accession Numbers.
8
12
Entering edit mode
11.7 years ago
Sanjar ▴ 140

I have a looong list of accession numbers, for which I need to fetch genbank entries. Usually, I used to use epost-efetch workflow for long lists. But now I can not, because epost doesn't accept accession numbers as IDs.

So the question is:

a) is there a way of batch downloading using accession numbers? if not b) is there a way of mapping Accession number to normal ids?

ncbi entrez • 39k views
ADD COMMENT
0
Entering edit mode

can you give some example of your accession number? where is it from?

ADD REPLY
0
Entering edit mode

Ex: A22237,A22239,A32021,A32022,A33397 Those are accessions from NCBI. When you post them using epost, it gives this error: "IDs contain invalid characters which was treated as delimiters." So it appears to me that epost doesn't accept non-numeric characters for ID field. I tried to change the letters to their ascii codes, didn't help.

ADD REPLY
0
Entering edit mode

One more thing that I noticed today: All the BioXXX libraries just stop when they get error from epost. But I noticed that along with error, epost still returns WebEnv and query_key. But what it does is that, it takes the accession number, trims out the non-numeric characters, and searches for resultant GID. So, A22237 turns to 22237. Don't know what to do. Such a tiny problem taking up lot's of time.

ADD REPLY
8
Entering edit mode
8.6 years ago
Eli Korvigo ▴ 230

Python 3 + BioPython with command-line interface. Basically, a less hardcoded and flexible version of Leszek's code, that doesn't fail when the list of accessions is so big, that GI retrieval results in HTTP Error 414: Request-URI Too Large

#! /usr/bin/env python3


import argparse
import sys
import os

import Bio.Entrez


RETMAX = 10**9
GB_EXT = ".gb"


def parse_args(arg_lst):
    parser = argparse.ArgumentParser()
    parser.add_argument("-i", "--input", type=str, required=True,
                        help="A file with accessions to download")
    parser.add_argument("-d", "--database", type=str, required=True,
                        help="NCBI database ID")
    parser.add_argument("-e", "--email", type=str, required=False,
                        default="some_email@somedomain.com",
                        help="An e-mail address")
    parser.add_argument("-b", "--batch", type=int, required=False, default=100,
                        help="The number of accessions to process per request")
    parser.add_argument("-o", "--output_dir", type=str, required=True,
                        help="The directory to write downloaded files to")

    return parser.parse_args(arg_lst)


def read_accessions(fp):
    with open(fp) as acc_lines:
        return [line.strip() for line in acc_lines]


def accessions_to_gb(accessions, db, batchsize, retmax):
    def batch(sequence, size):
        l = len(accessions)
        for start in range(0, l, size):
            yield sequence[start:min(start + size, l)]

    def extract_records(records_handle):
        buffer = []
        for line in records_handle:
            if line.startswith("LOCUS") and buffer:
                # yield accession number and record
                yield buffer[0].split()[1], "".join(buffer)
                buffer = [line]
            else:
                buffer.append(line)
        yield buffer[0].split()[1], "".join(buffer)

    def process_batch(accessions_batch):
        # get GI for query accessions
        query = " ".join(accessions_batch)
        query_handle = Bio.Entrez.esearch(db=db, term=query, retmax=retmax)
        gi_list = Bio.Entrez.read(query_handle)['IdList']

        # get GB files
        search_handle = Bio.Entrez.epost(db=db, id=",".join(gi_list))
        search_results = Bio.Entrez.read(search_handle)
        webenv, query_key = search_results["WebEnv"], search_results["QueryKey"]
        records_handle = Bio.Entrez.efetch(db=db, rettype="gb", retmax=batchsize,
                                           webenv=webenv, query_key=query_key)
        yield from extract_records(records_handle)

    accession_batches = batch(accessions, batchsize)
    for acc_batch in accession_batches:
        yield from process_batch(acc_batch)


def write_record(dir, accession, record):
    with open(os.path.join(dir, accession + GB_EXT), "w") as output:
        print(record, file=output)


def main(argv):
    args = parse_args(argv)
    accessions = read_accessions(os.path.abspath(args.input))
    op_dir = os.path.abspath(args.output_dir)
    if not os.path.exists(op_dir):
        os.makedirs(op_dir)
    dbase = args.database
    Bio.Entrez.email = args.email
    batchsize = args.batch

    for acc, record in accessions_to_gb(accessions, dbase, batchsize, RETMAX):
        write_record(op_dir, acc, record)


if __name__ == "__main__":
    main(sys.argv[1:])
ADD COMMENT
0
Entering edit mode

If I want to download the full genbank file of my query id , I found this scripty can't work.and then ,I set the rettype to genbank , It doesn't work . So ,what should I do ? I want to download thousands of genome files

ADD REPLY
0
Entering edit mode

Do you mean that the file has no sequence in it? There are some entries in GenBank that do not provide sequences, only annotation and metadata. If it's not the case, then I'll gladly help you and update the code. Waiting for your feedback.

ADD REPLY
0
Entering edit mode

your code is pretty cool . I mean I want to download complete genome file . I just don't know how to select the rettype . I have solved the issue . If want to download complete genome file (genbank full), the rettype should be gbwithparts(rettype="gbwithparts") ,which means genbank file with protein sequences. Thank you so much ,and thanks for your nice code.

ADD REPLY
0
Entering edit mode

Hi, probably little off topic but I wasn't able to find anywhere how to pass seq_start and seq_stop optional arguments to list of queries. What I mean is, when, like in this script, you search batch of accs, obtain web-env key to easy e-fetch what you want in one step. Now e-fetch accepts seq_start and seq_stop controlling what part of sequence you get, but how to pass this information to e-fetch for multiple sequences is a mystery to me. (From what I've tried it returns only first entry correctly omitting others or it omits seq_start and seq_stop and return those entries.

Any suggestion would be appreciated. Thanks

ADD REPLY
0
Entering edit mode

Hi , Would it be possible to get the fasta sequence given a list of accessions instead of the genbank files ??

ADD REPLY
0
Entering edit mode

Yes, you need to change the rettype argument from gb to fasta

ADD REPLY
0
Entering edit mode

Hi, I tried this script but it shows syntax error as below:

File "accfetchfinal.py", line 66
    yield from extract_records(records_handle)
             ^
SyntaxError: invalid syntax

Could you help in this pls..?

ADD REPLY
1
Entering edit mode

Hello, sorry for a late reply. Your Python seems to be too old. You should update to Python >= 3.3.2

ADD REPLY
7
Entering edit mode
8.5 years ago
ifreecell ▴ 220

I have successfully used wget to download more than 1000 plastid Genbank files using Accessions (with version, NC_018523.1). It's not fast, but works.

mkdir -p gb fa
cut -f 1 plastid.accs | while read -r acc
do
    wget -O gb/"${acc}.gb" \
       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=gb&sendto=on&id=$acc"
#    sleep 1s
#    wget -O fa/"${acc}.fa" \
#       "http://www.ncbi.nlm.nih.gov/sviewer/viewer.fcgi?db=nuccore&dopt=fasta&sendto=on&id=$acc"
done

To map accession to gi, you could try

wget ftp://ftp.ncbi.nih.gov/pub/taxonomy/accession2taxid/nucl_gb.accession2taxid.gz # ~800MB
gunzip nucl_gb.accession2taxid.gz
grep -f plastid.accs nucl_gb.accession2taxid | cut -f  1, 4 > Acc_GI.csv
ADD COMMENT
0
Entering edit mode

I have the same question . I try your method. It works.But If I want to download the full genbank file . what should I do? change the " dopt=gb"? what type should I use?

ADD REPLY
0
Entering edit mode

Replacing dopt=gb with dopt=gbwithparts should work, but I have not try yet.

ADD REPLY
0
Entering edit mode

It's not fast, but works.

ADD REPLY
5
Entering edit mode
11.7 years ago
Leszek 4.2k

You can try this:

#!/usr/bin/env python
"""Fetch GenBank entries for given accessions. 

USAGE:
  python acc2gb.py A22237 A22239 A32021 A32022 A33397 > out.gb
or
  cat ids | python acc2gb.py > out.gb

DEPENDENCIES:
Biopython
"""

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#fetch
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
for i,acc in enumerate(accs):
  try:
    sys.stderr.write( " %9i %s          \r" % (i+1,acc))  
    handle = Entrez.efetch(db=db, rettype="gb", id=acc)
    #print output to stdout
    sys.stdout.write(handle.read())
  except:
    sys.stderr.write( "Error! Cannot fetch: %s        \n" % acc)  

EDIT
The same using epost:

import sys
from Bio import Entrez

#define email for entrez login
db           = "nuccore"
Entrez.email = "some_email@somedomain.com"
batchSize    = 100
retmax       = 10**9

#load accessions from arguments
if len(sys.argv[1:]) > 1:
  accs = sys.argv[1:]
else: #load accesions from stdin  
  accs = [ l.strip() for l in sys.stdin if l.strip() ]
#first get GI for query accesions
sys.stderr.write( "Fetching %s entries from GenBank: %s\n" % (len(accs), ", ".join(accs[:10])))
query  = " ".join(accs)
handle = Entrez.esearch( db=db,term=query,retmax=retmax )
giList = Entrez.read(handle)['IdList']
sys.stderr.write( "Found %s GI: %s\n" % (len(giList), ", ".join(giList[:10])))
#post NCBI query
search_handle     = Entrez.epost(db=db, id=",".join(giList))
search_results    = Entrez.read(search_handle)
webenv,query_key  = search_results["WebEnv"], search_results["QueryKey"] 
#fecth all results in batch of batchSize entries at once
for start in range( 0,len(giList),batchSize ):
  sys.stderr.write( " %9i" % (start+1,))
  #fetch entries in batch
  handle = Entrez.efetch(db=db, rettype="gb", retstart=start, retmax=batchSize, webenv=webenv, query_key=query_key)
  #print output to stdout
  sys.stdout.write(handle.read())

ADD COMMENT
0
Entering edit mode

I usually use this way. But it's extremely slow when you want to download thousands of entries and sometimes falls to timeout. Using epost would be much better. But, apparently there's no way of doing that.

ADD REPLY
0
Entering edit mode

you can do epost easily with biopython - you just have to be sure to provide valid accessions, otherwise the script will crash... have a look here how to do it: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc114

ADD REPLY
0
Entering edit mode

I'm having the same issue with epost not taking accession numbers but integer UIDs instead...

ADD REPLY
0
Entering edit mode

Look at edited option: you can provide any id that is accepted by NCBI and the script will get UIDs of these automatically and print fasta...

ADD REPLY
0
Entering edit mode

This works fantastic! Just couldn't think myself of searching by accessions and getting ['IdList'] back. Thanks Leszek.

ADD REPLY
1
Entering edit mode

you can find more tricks in biopython tutorial: http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc110

ADD REPLY
0
Entering edit mode

@ Leszec I want to retrieve gene symbols using RefSeq protein(stored in text file) how can i do this using the program when i run the script saving it as .py its showing "SYNTAX ERROR" I use Python2.7 Help me to work on the script Thanks in advance

ADD REPLY
0
Entering edit mode

Thank you for this solution. But its now 8 years later, is there really no better way to do this?

ADD REPLY
2
Entering edit mode
11.7 years ago
Daniel ★ 4.0k

The hit-it-on-the-head-with-a-hammer alternative is to batch download a whole section of genbank, then filter out the ones you want using bioperl which will take accessions.

i.e. something along the lines of

    foreach my $acc (@acc){
            if ($seq->accession eq $acc){
                       $fileout->write_seq($seq);           ##etc. etc. etc.
            }
ADD COMMENT
0
Entering edit mode

Any suggestions on how to download the whole data of genbank?

ADD REPLY
0
Entering edit mode

There is the ftp if you want to batch DL ftp://ftp.ncbi.nlm.nih.gov/genbank/ But if you're only concerned about a certain taxa (prok, euk, human, whatever) you can just browser download it via http://www.ncbi.nlm.nih.gov/nuccore

ADD REPLY
0
Entering edit mode
11.7 years ago
secretjess ▴ 210

I'm not too sure which accession number you mean and what normal ID you are referring to but http://david.abcc.ncifcrf.gov/conversion.jsp has the ability to map between various accessions (e.g. genbank, uniprot) and IDs (e.g. entrez, ensembl)

ADD COMMENT
0
Entering edit mode

I'm working with GenBank. So, I need to map Genbank accession numbers to Genbank GI number. DAVID can't find the majority of the accessions that I submit. Moreover, it doesn't provide API, afaik.

ADD REPLY
0
Entering edit mode

Ah, sorry that wasn't helpful. This looks like a similar question to yours (in reverse) so maybe the solution will be more useful: Map Genbank Gi To Accession Numbers (A Million Times)

ADD REPLY
1
Entering edit mode

I had a look on that thread. The problem is not solved there. If I had GIDs first, and wanted to convert them to Accessions, then that would be viable via epost-efetch:). But can't do the other way around. Thank you for your suggestions.

ADD REPLY
0
Entering edit mode
11.7 years ago
Wen.Huang ★ 1.2k

two not so specific cents:

1) NCBI has a well documented but under-used collection of E-utilities you may want to try

2) Bio-perl has modules to fetch GenBank entries using different IDs.

ADD COMMENT
0
Entering edit mode

1) Yes, I like very much working with E-utils. It's the first time I got stuck with a problem. 2) True. There's no problem in fetching by accession number in one-request-at-a-time manner. But it's slow, due to 1/3 seconds limitation of NCBI. The problem comes when you want to try epost for batch downloading. Thank you for your suggestions!

ADD REPLY
0
Entering edit mode
11.7 years ago

Depends on how long "long" is, but up to some length, you can even use the Entrez web interface to get these e.g. querying with "A22237 A22239 A32021 A32022 A33397" here http://www.ncbi.nlm.nih.gov gives you a link to this page:

http://www.ncbi.nlm.nih.gov/nuccore/A22237A22239A32021A32022A33397

and you can get these in a range of formats e.g. full genbank, fasta, etc., using the "Send to" functionality

ADD COMMENT
0
Entering edit mode
11.4 years ago
Sanjar ▴ 140

The script provided by Leszek solves the problem.

ADD COMMENT

Login before adding your answer.

Traffic: 1968 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6