Question

NCBI eutils: retrieve all isolation sources and pubmed IDs

0

Entering edit mode

9.3 years ago

Xapple ▴ 30

Hi !

For some research we are doing, we would like to build a small database (such as in SQLite3) where we would store just one table with three columns. The first would be the GI number of every sequence in the current NT database. You can get these easily with this command:

     blastdbcmd -db nt -entry all -outfmt '%g' > all_gis.txt

The second column, would contain the "isolation_source" entry for that sequence, if it has one in its record, otherwise we can ommit that row from the database.

The last column should contain the pubmed ID of the publication associated with that sequence, if it has one.

This isn't too hard to do, and I have written a script that does exactly that by querying NCBI trough the eutils with biopython:

The problem is that as it is running right now, the current estimate on the finish time is 170+ hours. I would like to be able to have some results faster... Do you know of any way to optimize this process ? Maybe by changing the queries that are sent to NCBI ? Currently the script queries NCBI for a particular GI number and receives the whole XML entry back. Is there a way to formulate a query to only obtain the two fields that interest us: isolation_source and pubmed_id ? It's quite frustrating to only be able to access this huge and very useful database over the web with some custom-archaic utils like "efetch" etc.

Thanks !

ncbi eutils python • 2.6k views

ADD COMMENT • link updated 2.2 years ago by Ram 44k • written 9.3 years ago by Xapple ▴ 30

1

Entering edit mode

Why don't you (1) download all sequences from GenBank (ftp://ftp.ncbi.nlm.nih.gov/genbank/) in GenBank record (Flat file) and then (2) loop over each record using Python and extract isolation source and pubmed id?

ADD REPLY • link 9.3 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Hey you're right. That would actually solve this whole issue. So ALL the genbank entries are stored in those archives? That's going to weigh a lot on the hard drive but it's probably doable. Thanks.

ADD REPLY • link updated 2.2 years ago by Ram 44k • written 9.2 years ago by Xapple ▴ 30