I have a list of about 7000 NCBI gi numbers from one program and I wish to take these into the next program, but this requires the refseq accession numbers i.e. those which start NC_*.
e.g. of what I have: `154350369
154350369
594021901
811154183
407962962
407955691
218540569
What is the best approach?! I have a previous file with the sequence attached to the GI numbers but using blast I only managed to get back the same NCBI accessions (may have been user error?).
Using the following code it is easy to go from GI number to genbank accession (stolen from the docs), but again, still in the genbank format not refseq. I think I have to use the eLink feature and then use eSearch with the search term srcdb_refseq[prop] for the right linked file? (not that I'm sure how to do thins) or would blast be easier? (if I figured how to use the command line version!)
use LWP::Simple;
$gi_list = '154350369, 594021901, 811154183, 407962962';
#assemble the URL
$base = 'https://eutils.ncbi.nlm.nih.gov/entrez/eutils/';
$url = $base . "efetch.fcgi?db=nucleotide&id=$gi_list&rettype=acc";
#post the URL
$output = get($url);
print "$output";
Very confused and a sanity check needed! Thank you!
Thanks for your help! Unfortunately that doesn't work using my GI numbers as they are the genbank ones and not the refseq gi numbers. Any idea if there's a linked file for the refseq accessions?
Following query returns
NM_019353.1
for me.please try one of the GI numbers 154350369, 59402190, 811154183. These are the example of the GI numbers I have :). My mistake though, I script was taken from the help pages for the utilities and I forgot to substitute their GI numbers for mine to save confusion! Sorry!
You can have a look at those GIs on the NCBI and check if a RefSeq accession exists for them.
Also I am not sure if refseq GI number really exists, as my understanding is that refseq only has an accession number attached to it as opposed to a GI. Depending on time BLAST run takes, you could re-run BLAST and extract accession number directly using blast+ commands. For BLAST+ you could use
-outfmt '6 qseqid qlen sseqid sacc'
where sacc is subject accession number.As an example:
That is eventually where I need to end up. but how to I get the first GI number, if I start with the second GI number?
returns
NC_004350.2
I don't think you can if they are not linked any way.
Sorry for editing my last answer -I had ran out of posts for the next 6 hours! OK - I think what I'll have to do is retrieve the taxomic name, then search for the name but specify in the search terms "srcdb_refseq[prop]" which is the refseq database? It is a shame they have no linking information! Thank you for your help.