Entrez/E-utilities or Blast? NCBI gi numbers to RefSeq accession
4
0
Entering edit mode
8.0 years ago
mdrnao • 0

I have a list of about 7000 NCBI gi numbers from one program and I wish to take these into the next program, but this requires the refseq accession numbers i.e. those which start NC_*.

e.g. of what I have: `154350369

154350369 
594021901
811154183
407962962
407955691
218540569

What is the best approach?! I have a previous file with the sequence attached to the GI numbers but using blast I only managed to get back the same NCBI accessions (may have been user error?).

Using the following code it is easy to go from GI number to genbank accession (stolen from the docs), but again, still in the genbank format not refseq. I think I have to use the eLink feature and then use eSearch with the search term srcdb_refseq[prop] for the right linked file? (not that I'm sure how to do thins) or would blast be easier? (if I figured how to use the command line version!)

use LWP::Simple;
$gi_list = '154350369, 594021901, 811154183, 407962962';

#assemble the URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acc";

#post the URL
$output = get($url);
print "$output";

Very confused and a sanity check needed! Thank you!

genome • 4.2k views
ADD COMMENT
1
Entering edit mode
8.0 years ago
mdrnao • 0

OK solved the problem!! Using the Genbank accession numbers, I use the below script to search NCBI but specifying the RefSeq database, returning the RefSeq accession! It's not perfect as some do not create a match but I'll take it! I just use bash to loop around the files in my directory and direct the prints to a new file.

I do, however, get an error message (below) and I'm not sure why! But it works! Thanks to everyone for your suggestions, it helped point me in the right direction.

Use of uninitialized value in print at databaseQuery.pl line 25.

#!/usr/local/bin/perl 
use Bio::DB::EUtilities;
use strict;
use warnings;

my $line = ();
my $filename = $ARGV[0];
open(my $fh ,'<' ,$filename ) or die "could not open file $!";

foreach $line (<$fh>)
        {
        chomp $line;
        my $factory = Bio::DB::EUtilities->new(-eutil => 'esearch',
                                       -db     => 'nuccore',
                                       -term   => "$line AND srcdb_refseq[prop]",
                                       -email  => 'mymail@foo.bar',
                                       -retmax => 5000);

# query terms are mapped; what's the actual query?
print "Query translation: ",$factory->get_query_translation,"\n";

# query hits
print "Count = ",$factory->get_count,"\n";

# UIDs
my @ids = $factory->get_ids;
print "> @ids \n";
}
print "\n";
ADD COMMENT
1
Entering edit mode
8.0 years ago
Sej Modha 5.3k

NCBI Unix e-utils equivalent would be

esearch -db nucleotide -query "9507198" |efetch -format acc
ADD COMMENT
0
Entering edit mode

Thanks for your help! Unfortunately that doesn't work using my GI numbers as they are the genbank ones and not the refseq gi numbers. Any idea if there's a linked file for the refseq accessions?

ADD REPLY
0
Entering edit mode

Following query returns NM_019353.1 for me.

esearch -db nucleotide -query "9507198" |efetch -format acc
ADD REPLY
0
Entering edit mode

please try one of the GI numbers 154350369, 59402190, 811154183. These are the example of the GI numbers I have :). My mistake though, I script was taken from the help pages for the utilities and I forgot to substitute their GI numbers for mine to save confusion! Sorry!

ADD REPLY
0
Entering edit mode

You can have a look at those GIs on the NCBI and check if a RefSeq accession exists for them.

ADD REPLY
0
Entering edit mode

Also I am not sure if refseq GI number really exists, as my understanding is that refseq only has an accession number attached to it as opposed to a GI. Depending on time BLAST run takes, you could re-run BLAST and extract accession number directly using blast+ commands. For BLAST+ you could use -outfmt '6 qseqid qlen sseqid sacc' where sacc is subject accession number.

ADD REPLY
0
Entering edit mode

As an example:

Streptococcus mutans UA159 chromosome, complete genome 2,032,925 bp circular DNA Accession: NC_004350.2 GI: 347750429

Streptococcus mutans UA159, complete genome 2,032,925 bp circular DNA Accession: AE014133.2 GI: 345287734`

That is eventually where I need to end up. but how to I get the first GI number, if I start with the second GI number?

ADD REPLY
0
Entering edit mode
esearch -db nucleotide -query "347750429" |efetch -format acc

returns NC_004350.2

ADD REPLY
0
Entering edit mode

I don't think you can if they are not linked any way.

ADD REPLY
0
Entering edit mode

Sorry for editing my last answer -I had ran out of posts for the next 6 hours! OK - I think what I'll have to do is retrieve the taxomic name, then search for the name but specify in the search terms "srcdb_refseq[prop]" which is the refseq database? It is a shame they have no linking information! Thank you for your help.

ADD REPLY
0
Entering edit mode
8.0 years ago
Diwan ▴ 650

Eutils option will be helpful for this. For example, we can use the Rentrez package in R and do the following

library(rentrez)
GI_match <- entrez_summary(db = "nucleotide", id = "9507198")
GI_match$caption

we get "NM_019353" as output. HTH

ADD COMMENT
0
Entering edit mode

That package is great! But unfortunately using my GI numbers (which are genbank GI numbers) I still get back the genbank accession number. I'll try to hunt down an additional database, maybe I can link the two together? My first language was R so maybe I'll have more luck here! Thank you for your help!

ADD REPLY
0
Entering edit mode
 get_acc <- function(ab){
        GI_match <- entrez_summary(db = "nucleotide", id = ab)
        GI_match$extra
    }

GI_query <- c("154350369","594021901","811154183","407962962")
OUT <- sapply(GI_query,get_acc)

OUT[1]

"gi|154350369|gb|CP000560.1|"

It seems to be working for your query too, unless you are looking for some other id.

ADD REPLY
0
Entering edit mode

CPaccessions are the genbank accessions - I am after the refseq accessions i.e. NC_. As an example:

Streptococcus mutans UA159 chromosome, complete genome 2,032,925 bp circular DNA Accession: NC_004350.2 GI: 347750429

Streptococcus mutans UA159, complete genome 2,032,925 bp circular DNA Accession: AE014133.2 GI: 345287734

I have the second, I want the first! I think it might be impossible, so I have been trying to extract the genomic name then use that with esearch.

ADD REPLY
0
Entering edit mode
8.0 years ago

The BBMap package has a nifty tool for things like this, gi2taxid.sh. It only works on fasta files, though. You can use it like this:

gi2taxid.sh in=x.fa out=y.fa gi=gitable.int1d.gz

That's after downloading gi_taxid_nucl.dmp.gz and gi_taxid_prot.dmp.gz from NCBI, and converting them like this:

gitable.sh gi_taxid_nucl.dmp.gz,gi_taxid_prot.dmp.gz gitable.int1d.gz

Sorry it's a pain, but NCBI is not a very responsive institution, so their file formats are not very useful. If pure GI number to NCBI number (without fasta files) conversion is crucial, let me know, and I'll add a tool that does that.

ADD COMMENT
0
Entering edit mode

Thanks for this, however I am missing a step still to switch from genbank to Refseq! Unfortunately the GI numbers I have are genbank, so I need to find a way to get the linking information. Surely somewhere on NCBI there is a file with both accession numbers!? I need to go from one database to another

ADD REPLY

Login before adding your answer.

Traffic: 2831 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6