Hi !
For some research we are doing, we would like to build a small database (such as in SQLite3) where we would store just one table with three columns. The first would be the GI number of every sequence in the current NT database. You can get these easily with this command:
blastdbcmd -db nt -entry all -outfmt '%g' > all_gis.txt
The second column, would contain the "isolation_source" entry for that sequence, if it has one in its record, otherwise we can ommit that row from the database.
The last column should contain the pubmed ID of the publication associated with that sequence, if it has one.
This isn't too hard to do, and I have written a script that does exactly that by querying NCBI trough the eutils with biopython:
The problem is that as it is running right now, the current estimate on the finish time is 170+ hours. I would like to be able to have some results faster... Do you know of any way to optimize this process ? Maybe by changing the queries that are sent to NCBI ? Currently the script queries NCBI for a particular GI number and receives the whole XML entry back. Is there a way to formulate a query to only obtain the two fields that interest us: isolation_source and pubmed_id ? It's quite frustrating to only be able to access this huge and very useful database over the web with some custom-archaic utils like "efetch" etc.
Thanks !
Why don't you (1) download all sequences from GenBank (ftp://ftp.ncbi.nlm.nih.gov/genbank/) in GenBank record (Flat file) and then (2) loop over each record using Python and extract isolation source and pubmed id?
Hey you're right. That would actually solve this whole issue. So ALL the genbank entries are stored in those archives? That's going to weigh a lot on the hard drive but it's probably doable. Thanks.