Question

How Can I Programmatically Retrieve The Genbank Records With Accession Numbers In The Form Jn######?

2

Entering edit mode

13.1 years ago

Jason Ebaugh ▴ 60

I am trying to use NCBIs E-utilities to retrieve sequences from GenBank. All the accession number are like this: JN###### (example: JN556047)

I am using Ebot to generate scripts to pull the records. The E-utilities server does not like the UIDs in the form of JN######. I tried dropping th "JN". It "worked" from the server's point of view, but it did not give me the correct records.

When I go to the web interface for the "Nucleotide" database, it accepts JN###### UIDs no problem. However, the web-based interface will only retrieve 100 records at a time, and I have 1700 to get.

How can I retrieve the GenBank records with accession numbers in the form JN######?

• 9.6k views

ADD COMMENT • link updated 13.1 years ago by Yannick Wurm ★ 2.5k • written 13.1 years ago by Jason Ebaugh ▴ 60

score 8 · Answer 1 · 2012-03-11

Here is a perl script that uses a BioPerl module "Bio::DB::GenBank". All the accession number must be present within the file accnumber.txt each separated my a comma or present in a new line. And also, file accnumber.txt must be present within the same directory as that of the perl-script. After successful execution it will generate a file sequence_download.fa containing the sequence in fasta format. If you want to retrieve any other data from the GenBank database use could just tweak the code looking up the same module "Bio::DB::GenBank".

#!usr/bin/perl -w

use strict;
use warnings;

use Bio::DB::GenBank;

my $input_file = 'accnumber.txt';
my $output_file = 'sequence_download.fa';

open (INPUT_FILE, $input_file);
open (OUTPUT_FILE, ">$output_file");

while(<INPUT_FILE>)
{
    chomp;

    my $line = $_;
    my @acc_no = split(",", $line);

    my $counter = 0;

    while ($acc_no[$counter])
    {
        $acc_no[$counter] =~ s/\s//g;

        if ($acc_no[$counter] =~ /^$/)
        {
            exit;
        }

        my $db_obj = Bio::DB::GenBank->new;

        my $seq_obj = $db_obj->get_Seq_by_acc($acc_no[$counter]);

        my $sequence1 = $seq_obj->seq;

        print OUTPUT_FILE ">"."$acc_no[$counter]","\n";

        print OUTPUT_FILE $sequence1,"\n";

        print "Sequence Downloaded:", "\t", $acc_no[$counter], "\n";

        $counter++;
    }
}

close OUTPUT_FILE;
close INPUT_FILE;

exit;

score 3 · Answer 2 · 2011-10-30

3

Entering edit mode

13.1 years ago

Pierre Lindenbaum 164k

You can get a list of all the ACNs in ftp://ftp.ncbi.nih.gov/genbank/livelists/

you can then retrieve the ACN/version/gi using the following command:

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.1023.2011.gz" |\
   gunzip -c | egrep '^JN'

and retrieve each sequence using NCBI EFetch http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html

ADD COMMENT • link 13.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Individual, and small sets, of LiveLists records can be retrieved using EMBL-EBI's dbfetch and WSDbfetch services. For example: http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=livelists&id=JN556047&style=raw

ADD REPLY • link 12.7 years ago by Hamish ★ 3.3k

score 3 · Answer 3 · 2011-10-30

The EUtils methods in BioRuby have no problem fetching JN* accessions:

#!/usr/bin/ruby
require 'rubygems'  # ruby 1.8
require 'bio'

Bio::NCBI.default_email = "me@me.com"
gb = Bio::NCBI::REST::EFetch.nucleotide("JN556047")

puts gb
# showing first few lines only
# LOCUS       JN556047                 164 bp    DNA     linear   INV 19-OCT-2011
# DEFINITION  Apis cerana isolate A0101 cGMP-dependent protein kinase foraging
#             (For) gene, exon 3 and partial cds.
# ACCESSION   JN556047
# VERSION     JN556047.1  GI:351634776
# KEYWORDS    .
# SOURCE      Apis cerana (Asiatic honeybee)
#   ORGANISM  Apis cerana
#             Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
#             Neoptera; Endopterygota; Hymenoptera; Apocrita; Aculeata; Apoidea;
#             Apidae; Apis.
....

Ram · Answer 4 · 2012-03-10

Here's a variant on what Neil replied, inspired from this question.

#!/bin/env ruby
require 'rubygems'
require 'bio'

Bio::NCBI.default_email = "xxxxx@qmul.ac.uk"
ncbi        = Bio::NCBI::REST.new
sequenceIDs = ("JP773711".."JP820231").to_a
sequences   = ncbi.efetch(ids = sequenceIDs,
                         {"db"=>"nuccore", 
                          "rettype"=>"fasta",
                          "retmax"=> 10000000})

# ncbi returns a single big string with records separated by two newlines  - we just want one.              
sequences.gsub!("\n\n", "\n")

File.open('Nylanderia_pubens_ests.fasta', 'w') {|f| f.write(sequences +"\n") }
puts "done."

score 0 · Answer 5 · 2012-03-11

0

Entering edit mode

12.7 years ago

Priyabrata ▴ 70

You can go to E-utilities @ NCBI (http://www.ncbi.nlm.nih.gov/books/NBK25500/). Use any programming language to retrieve, search.

ADD COMMENT • link 12.7 years ago by Priyabrata ▴ 70