Question

Programatically fetch all GI numbers for nucleotide sequences of a species from NCBI?

0

Entering edit mode

9.1 years ago

biocyberman ▴ 870

I want to make an blastdb alias with blastdb_aliastool for Homo sapiens. I can go online to search and export a GI list and use it with the aliastool. However, I will have to repeat the process again after some time. Is there away I can fetch GI numbers from a command line or via Python/Biopython, R, and if not avoidable, Perl.

Update

After reading elsewhere on biostars and following the instruction Here to install edirect command line apps, I did this:

epost -db taxonomy -id 9606|elink -batch -target nuccore |efetch -format uid >hsapiens.gi.txt

The output hsapiens.gi.txt contains only 100000 records, this is the maximum retrieval for one batch. Searching further about E-Utilities, I think I have to use `retstart` paramenter somewhere in the command line above. But how?

ncbi blast biopython edirect • 3.4k views

ADD COMMENT • link updated 9.1 years ago by 5heikki 11k • written 9.1 years ago by biocyberman ▴ 870

2

Entering edit mode

Im sure it's possible to bruteforce/scrape your way to all the GIs, or programatically via biomart maybe, but my experience of doing this against the NCBI's websites is to not bother, because whatever you write to do the scraping now will not work in 6 months. You are better off just writing a nice e-mail to the NCBI and asking for the data. They will reply before lunch time, maybe sooner if its all just sitting in an Excel file anyway from the last person that asked, and they may even tell you about API/SQL access you could use to keep your lists as up-to-date as possible. This goes for pretty much anything where you want a large amount of data once, then not much ever again.

ADD REPLY • link 9.1 years ago by John 13k

1

Entering edit mode

Thanks for the suggestion. I will contact NCBI if the question can't be answered soon here. I chose to post here because the question may benefit others as well.

ADD REPLY • link 9.1 years ago by biocyberman ▴ 870

2

Entering edit mode

9.1 years ago

GenoMax 147k

Get a copy of this file: ftp://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2accession.gz

Column 1 contains the taxID's. You can get all kinds of other ID's once your parse the relevant lines out.

File above is regenerated regularly so should be current at all times.

ADD COMMENT • link 9.1 years ago by GenoMax 147k

0

Entering edit mode

The gi numbers in NCBI are asigned to non-gene sequences, so this approach is not applicable for my case, but it would be very useful in other cases.

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by biocyberman ▴ 870

Ram · Accepted Answer · 2015-10-21

2

Entering edit mode

9.1 years ago

5heikki 11k

You can extract the GIs straight from your db as instructed here

blastdbcmd -db nr -entry all -outfmt "%g %T" | \
   awk ' { if ($2 == 9606) { print $1 } } ' | \
   blastdbcmd -db nr -entry_batch - -out human_sequences.txt

In the above example you get the seqs, but I don't see why you couldn't pipe the GIs to blastdb_aliastool the same way..

ADD COMMENT • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by 5heikki 11k

0

Entering edit mode

Use blastdbcmd to dump the data out is a reliable way, but also not the fastest way. Anyway, it is a reliable and working way. I came up with this command to get GI numbers I want and use them for blastdb_aliastool command.

blastdbcmd -db nt -entry all -outfmt "%g %T"|gawk '( $2 == 9606 ) {count+=1; print $1 >hs.gi.txt}END{print "Found: "count" records"}'

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.1 years ago by biocyberman ▴ 870