Question

Find All Genbank Submissions Associated With An Article Publication In 2005

7

Entering edit mode

14.2 years ago

Heather Piwowar ▴ 380

Is there an easy way to query Genbank for all sequence submissions associated with ANY journal article published in 2005? So, all sequences that have a JOURNAL metadata line that includes "2005" (and ideally a real journal name and/or PMID) but without a related TITLE of Direct Submission. I'm interested in datasets associated with ANY AND ALL 2005 publications.

Using a filter that includes "2005"[pdat] will limit the responses to sequences that were deposited in 2005, but this isn't quite what I want.

An example of what I want to find: http://www.ncbi.nlm.nih.gov/nuccore/AY843753.1

And what I don't want to find: http://www.ncbi.nlm.nih.gov/nuccore/CP000223.3, http://www.ncbi.nlm.nih.gov/nuccore/CH478173.1, http://www.ncbi.nlm.nih.gov/nuccore/AY944235.1

Can metadata in the REFERENCES section be filtered using the NCBI web interface? Or any other web interface? Or can I do some creative PMID linking and filtering thing? Or am I best off downloading the metadata (using eutils?) and filtering it with scripts offline?

Thanks!

genbank retrieval eutils • 6.0k views

ADD COMMENT • link updated 11.8 years ago by cdsouthan ★ 1.9k • written 14.2 years ago by Heather Piwowar ▴ 380

score 5 · Answer 1 · 2011-02-16

5

Entering edit mode

14.2 years ago

Lars Juhl Jensen 11k

As far as I can see, you can achieve this by querying the NCBI Nucleotide database fields [Journal] and [Publication date] to achieve this. You can do this via the Entrez web interface or through eutils esearch (and then subsequent fetch the actual entries based on primary IDs):

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=(((Bull.%20Am.%20Mus.%20Nat.%20Hist.[Journal])%20AND%202005[Publication%20Date]))

ADD COMMENT • link 14.2 years ago by Lars Juhl Jensen 11k

0

Entering edit mode

Thanks, Lars. I wasn't clear in my post originally... I'm actually looking for datasets with links to ANY AND ALL journal articles, so I don't think this approach will get me what I need. I tried to NOT "Unpublished"[Journal] but that doesn't work (and isn't ideal anyway)

ADD REPLY • link 14.2 years ago by Heather Piwowar ▴ 380

Ram · Answer 2 · 2011-02-16

5

Entering edit mode

14.2 years ago

Maximilian Haeussler ★ 1.7k

If you filter on journal you will get ALL articles linked to a sequence. This will include mostly articles that have been annotated manually to related to the sequence in question. E.g. if a genbank record contains the sequence of TNF, the Genbank or Medline annotators (?) will add the paper to the Genbank record (at least they did it for some time).

So, to get the real answer, you will need to download and filter the records with your own scripts. Downloading is easy, as you probably don't want the high-throughput section. Filtering is easy, as all Bio* scripting language libraries contain parsers for Genbank files. Then you simply look at the LAST reference of the Genbank record, as explained here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#SubmitterBlockB

I have some old scripts that did this type of thing, if you need a starting point.

ADD COMMENT • link 14.2 years ago by Maximilian Haeussler ★ 1.7k

0

Entering edit mode

Thanks, Maximilian, your first paragraph is a very helpful clarification and consistent with my experience. Yes, pointers to some starting scripts would be great, if that is easy?

ADD REPLY • link 14.2 years ago by Heather Piwowar ▴ 380

0

Entering edit mode

Hi Heather, try this script: http://genomewiki.ucsc.edu/images/a/a4/GenbankToTables.txt

You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?). Replace the .txt to .py.

Then run it like this: genbankToTables [?] [?] --bigTable --onlySubmitter

ADD REPLY • link 14.2 years ago by Maximilian Haeussler ★ 1.7k

0

Entering edit mode

Hi Heather, try this script: http://genomewiki.ucsc.edu/index.php/Image:GenbankToTables.txt (download the file, do not copy & paste). You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?).

Replace the .txt to .py. Then run it like this: python genbankToTables.py [?] [?] --bigTable --onlySubmitter -

ADD REPLY • link updated 5.6 years ago by Ram 45k • written 14.2 years ago by Maximilian Haeussler ★ 1.7k

Ram · Answer 3 · 2011-02-16

I think there are 2 approaches to this problem.

Start by obtaining a list of PMIDs for articles published in 2005. You can do this using either EUtils or the PubMed website, by searching for 2005[DP]. There are 693 394 entries. If using the website, select "Send to File" and choose "PMID List" as the output.

Then either:

Use the EUtils elink query

Here is the ELink documentation. There are examples at the end. Basically, you would obtain the list of 2005 PMIDs as described above, then construct a URL to link PubMed and Nucleotide databases, along the lines of:
```
dbfrom=pubmed&db=nuccore&id=PMID
```
This will retrieve UIDs for query/retrieval of the nuccore database.
Using the NCBI FTP site Go to a very useful directory in the NCBI FTP site named /entrez/links. It contains compressed text files which link UIDs between pairs of Entrez databases. You'll find, for example, gene_pubmed.lnk.gz, nucleotide_pubmed.lnk.gz and genome_pubmed.lnk.gz. Now, all you have to do is parse the appropriate list to find the 2005 PMIDs, extract the corresponding sequence UIDs and use them to query the appropriate sequence database.

UPDATE: I just noticed that the file dates in /entrez/links are rather old; in fact, they are dated 04 which is not much use for 2005! You may have to dig around the FTP site for more recent files - I'll see if I can find them.

score 0 · Answer 4 · 2013-07-13

0

Entering edit mode

11.8 years ago

cdsouthan ★ 1.9k

SRS used to be able to do this but a complete GenBank is no longer indexed. But RefSeq still is so ([refseqgenrel-Year#2005:2005] > parent ) returns 305071 entries

ADD COMMENT • link 11.8 years ago by cdsouthan ★ 1.9k