How Can I Programmatically Retrieve The Genbank Records With Accession Numbers In The Form Jn######?
5
2
Entering edit mode
13.1 years ago
Jason Ebaugh ▴ 60

I am trying to use NCBIs E-utilities to retrieve sequences from GenBank. All the accession number are like this: JN###### (example: JN556047)

I am using Ebot to generate scripts to pull the records. The E-utilities server does not like the UIDs in the form of JN######. I tried dropping th "JN". It "worked" from the server's point of view, but it did not give me the correct records.

When I go to the web interface for the "Nucleotide" database, it accepts JN###### UIDs no problem. However, the web-based interface will only retrieve 100 records at a time, and I have 1700 to get.

How can I retrieve the GenBank records with accession numbers in the form JN######?

• 9.5k views
ADD COMMENT
8
Entering edit mode
12.7 years ago
raunakms ★ 1.1k

Here is a perl script that uses a BioPerl module "Bio::DB::GenBank". All the accession number must be present within the file accnumber.txt each separated my a comma or present in a new line. And also, file accnumber.txt must be present within the same directory as that of the perl-script. After successful execution it will generate a file sequence_download.fa containing the sequence in fasta format. If you want to retrieve any other data from the GenBank database use could just tweak the code looking up the same module "Bio::DB::GenBank".

#!usr/bin/perl -w

use strict;
use warnings;

use Bio::DB::GenBank;

my $input_file = 'accnumber.txt';
my $output_file = 'sequence_download.fa';

open (INPUT_FILE, $input_file);
open (OUTPUT_FILE, ">$output_file");

while(<INPUT_FILE>)
{
    chomp;

    my $line = $_;
    my @acc_no = split(",", $line);

    my $counter = 0;

    while ($acc_no[$counter])
    {
        $acc_no[$counter] =~ s/\s//g;

        if ($acc_no[$counter] =~ /^$/)
        {
            exit;
        }

        my $db_obj = Bio::DB::GenBank->new;

        my $seq_obj = $db_obj->get_Seq_by_acc($acc_no[$counter]);

        my $sequence1 = $seq_obj->seq;

        print OUTPUT_FILE ">"."$acc_no[$counter]","\n";

        print OUTPUT_FILE $sequence1,"\n";

        print "Sequence Downloaded:", "\t", $acc_no[$counter], "\n";

        $counter++;
    }
}

close OUTPUT_FILE;
close INPUT_FILE;

exit;
ADD COMMENT
3
Entering edit mode
13.1 years ago

You can get a list of all the ACNs in ftp://ftp.ncbi.nih.gov/genbank/livelists/

you can then retrieve the ACN/version/gi using the following command:

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/livelists/GbAccList.1023.2011.gz" |\
   gunzip -c | egrep '^JN'

and retrieve each sequence using NCBI EFetch http://eutils.ncbi.nlm.nih.gov/corehtml/query/static/efetch_help.html

ADD COMMENT
0
Entering edit mode

Individual, and small sets, of LiveLists records can be retrieved using EMBL-EBI's dbfetch and WSDbfetch services. For example: http://www.ebi.ac.uk/Tools/dbfetch/dbfetch?db=livelists&id=JN556047&style=raw

ADD REPLY
3
Entering edit mode
13.1 years ago
Neilfws 49k

The EUtils methods in BioRuby have no problem fetching JN* accessions:

#!/usr/bin/ruby
require 'rubygems'  # ruby 1.8
require 'bio'

Bio::NCBI.default_email = "me@me.com"
gb = Bio::NCBI::REST::EFetch.nucleotide("JN556047")

puts gb
# showing first few lines only
# LOCUS       JN556047                 164 bp    DNA     linear   INV 19-OCT-2011
# DEFINITION  Apis cerana isolate A0101 cGMP-dependent protein kinase foraging
#             (For) gene, exon 3 and partial cds.
# ACCESSION   JN556047
# VERSION     JN556047.1  GI:351634776
# KEYWORDS    .
# SOURCE      Apis cerana (Asiatic honeybee)
#   ORGANISM  Apis cerana
#             Eukaryota; Metazoa; Arthropoda; Hexapoda; Insecta; Pterygota;
#             Neoptera; Endopterygota; Hymenoptera; Apocrita; Aculeata; Apoidea;
#             Apidae; Apis.
....
ADD COMMENT
0
Entering edit mode
12.7 years ago
Yannick Wurm ★ 2.5k

Here's a variant on what Neil replied, inspired from this question.

#!/bin/env ruby
require 'rubygems'
require 'bio'

Bio::NCBI.default_email = "xxxxx@qmul.ac.uk"
ncbi        = Bio::NCBI::REST.new
sequenceIDs = ("JP773711".."JP820231").to_a
sequences   = ncbi.efetch(ids = sequenceIDs,
                         {"db"=>"nuccore", 
                          "rettype"=>"fasta",
                          "retmax"=> 10000000})

# ncbi returns a single big string with records separated by two newlines  - we just want one.              
sequences.gsub!("\n\n", "\n")

File.open('Nylanderia_pubens_ests.fasta', 'w') {|f| f.write(sequences +"\n") }
puts "done."
ADD COMMENT
0
Entering edit mode
12.7 years ago
Priyabrata ▴ 70

You can go to E-utilities @ NCBI (http://www.ncbi.nlm.nih.gov/books/NBK25500/). Use any programming language to retrieve, search.

ADD COMMENT

Login before adding your answer.

Traffic: 2960 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6