Obtaining NCBI GI Numbers from Taxonomy ID (for Entrez efetch query)
4
3
Entering edit mode
9.8 years ago
patroos ▴ 70

I'm trying to automatically obtain fasta files from the NCBI nucleotide database for a list of taxonomy IDs. I know I can use Entrez's efetch but it expects a GI number, which I do not have a list of. Is there a way to fetch by taxonomy ID or a straight-forward and non-manual way to get GI numbers from taxonomy IDs?

Thanks much for any help!

nucleotide Entrez NCBI • 7.0k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
6
Entering edit mode
9.8 years ago
5heikki 11k

This is really simple with Entrez Direct

epost -db taxonomy -id 63221 | elink -target nuccore | efetch -format uid
196123578
677001457
634744538
634744524
2286205
584458899
513134556
398637208
315623200
270209679
270209678
270209677
262527002
253947345
253947331
253947317
253947303
253947289
222350099
222350097
195972535
158958247
158251955
158251954
111035029
91075865
28557455
11141613
11141612
7769684
4927255

Also works with -target nucgss if that's what you're interested in. You can also skip the gi part and efetch -format fasta

ADD COMMENT
1
Entering edit mode

Fantastic, this is very useful. Thanks!

ADD REPLY
3
Entering edit mode
9.8 years ago
patroos ▴ 70

Haven't found a great solution for this. But the work-around I am using now is downloading /pub/taxonomy/gi_taxid_nucl.dmp.gz from the NCBI ftp server. The file maps GI numbers to taxonomy id, and I search it to get the GI numbers for a given taxonomy ID.

ADD COMMENT
2
Entering edit mode
9.8 years ago
David W 4.9k

The elink util will turn up cross-references between NCBI databases.

In this case you want to find links between the taxonomy and nucleotide databases. Here's a demo using the R pacakge rentrez (there are similar libraries for pretty much all popular scripting languages, and even command line utils for this):

## find sequenced linked to a taxid
tax_seqs <- entrez_link(db = "nuccore", dbfrom = "taxonomy", id=5911)
#elink result with ids from 2 databases:
#[1] taxonomy_nuccore        taxonomy_nucleotide_exp

grab them

tmp <- tempfile()
recs <- entrez_fetch(db="nuccore", id=tax_seqs$taxonomy_nuccore[1:3], rettype="fasta")
cat(recs, file=tmp)
ape::read.dna(tmp, format="fasta")
#3 DNA sequences in binary format stored in a list.
#
# Mean sequence length: 1493
# Shortest sequence: 1137
# Longest sequence: 1779
#
# Labels: gi|697738807|gb|KM406498.1| Tetrahymena thermophila Pat2 mRNA, complete cds #gi|697738801|gb|KM406497.1|  #Tetrahymena thermophila Tpt1 mRNA, complete cds #gi|697738796|gb|KM406496.1| Tetrahymena thermophila Pat1 mRNA, complete #cds
ADD COMMENT
0
Entering edit mode
9.5 years ago
natasha.sernova ★ 4.0k

It's much simplier - see http://www.bioperl.org/wiki/HOWTO:EUtilities_Cookbook.

Good luck!

ADD COMMENT

Login before adding your answer.

Traffic: 1686 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6