Find All Genbank Submissions Associated With An Article Publication In 2005
4
7
Entering edit mode
13.8 years ago

Is there an easy way to query Genbank for all sequence submissions associated with ANY journal article published in 2005? So, all sequences that have a JOURNAL metadata line that includes "2005" (and ideally a real journal name and/or PMID) but without a related TITLE of Direct Submission. I'm interested in datasets associated with ANY AND ALL 2005 publications.

Using a filter that includes "2005"[pdat] will limit the responses to sequences that were deposited in 2005, but this isn't quite what I want.

An example of what I want to find: http://www.ncbi.nlm.nih.gov/nuccore/AY843753.1

And what I don't want to find: http://www.ncbi.nlm.nih.gov/nuccore/CP000223.3, http://www.ncbi.nlm.nih.gov/nuccore/CH478173.1, http://www.ncbi.nlm.nih.gov/nuccore/AY944235.1

Can metadata in the REFERENCES section be filtered using the NCBI web interface? Or any other web interface? Or can I do some creative PMID linking and filtering thing? Or am I best off downloading the metadata (using eutils?) and filtering it with scripts offline?

Thanks!

genbank retrieval eutils • 5.5k views
ADD COMMENT
5
Entering edit mode
13.8 years ago

As far as I can see, you can achieve this by querying the NCBI Nucleotide database fields [Journal] and [Publication date] to achieve this. You can do this via the Entrez web interface or through eutils esearch (and then subsequent fetch the actual entries based on primary IDs):

http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=nucleotide&term=(((Bull.%20Am.%20Mus.%20Nat.%20Hist.[Journal])%20AND%202005[Publication%20Date]))

ADD COMMENT
0
Entering edit mode

Thanks, Lars. I wasn't clear in my post originally... I'm actually looking for datasets with links to ANY AND ALL journal articles, so I don't think this approach will get me what I need. I tried to NOT "Unpublished"[Journal] but that doesn't work (and isn't ideal anyway)

ADD REPLY
5
Entering edit mode
13.8 years ago

If you filter on journal you will get ALL articles linked to a sequence. This will include mostly articles that have been annotated manually to related to the sequence in question. E.g. if a genbank record contains the sequence of TNF, the Genbank or Medline annotators (?) will add the paper to the Genbank record (at least they did it for some time).

So, to get the real answer, you will need to download and filter the records with your own scripts. Downloading is easy, as you probably don't want the high-throughput section. Filtering is easy, as all Bio* scripting language libraries contain parsers for Genbank files. Then you simply look at the LAST reference of the Genbank record, as explained here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#SubmitterBlockB

I have some old scripts that did this type of thing, if you need a starting point.

ADD COMMENT
0
Entering edit mode

Thanks, Maximilian, your first paragraph is a very helpful clarification and consistent with my experience. Yes, pointers to some starting scripts would be great, if that is easy?

ADD REPLY
0
Entering edit mode

Hi Heather, try this script: http://genomewiki.ucsc.edu/images/a/a4/GenbankToTables.txt

You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?). Replace the .txt to .py.

Then run it like this: genbankToTables [?] [?] --bigTable --onlySubmitter

ADD REPLY
0
Entering edit mode

Hi Heather, try this script: http://genomewiki.ucsc.edu/index.php/Image:GenbankToTables.txt (download the file, do not copy & paste). You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?).

Replace the .txt to .py. Then run it like this: python genbankToTables.py [?] [?] --bigTable --onlySubmitter -

ADD REPLY
5
Entering edit mode
13.8 years ago
Neilfws 49k

I think there are 2 approaches to this problem.

Start by obtaining a list of PMIDs for articles published in 2005. You can do this using either EUtils or the PubMed website, by searching for 2005[DP]. There are 693 394 entries. If using the website, select "Send to File" and choose "PMID List" as the output.

Then either:

  1. Use the EUtils elink query

    Here is the ELink documentation. There are examples at the end. Basically, you would obtain the list of 2005 PMIDs as described above, then construct a URL to link PubMed and Nucleotide databases, along the lines of:

    dbfrom=pubmed&db=nuccore&id=PMID
    

    This will retrieve UIDs for query/retrieval of the nuccore database.

  2. Using the NCBI FTP site Go to a very useful directory in the NCBI FTP site named /entrez/links. It contains compressed text files which link UIDs between pairs of Entrez databases. You'll find, for example, gene_pubmed.lnk.gz, nucleotide_pubmed.lnk.gz and genome_pubmed.lnk.gz. Now, all you have to do is parse the appropriate list to find the 2005 PMIDs, extract the corresponding sequence UIDs and use them to query the appropriate sequence database.

UPDATE: I just noticed that the file dates in /entrez/links are rather old; in fact, they are dated 04 which is not much use for 2005! You may have to dig around the FTP site for more recent files - I'll see if I can find them.

ADD COMMENT
0
Entering edit mode

Neil, as a modification of #1 above, would it work if I started with the Nucleotide database ID and then link through to its PubMed IDs using

dbfrom=nuccore&db=pubmed&id=NUCCOREID

?. Then I could eliminate from my list all NUCCOREIDs that don't lead to PMIDs with a 2005 PMID date. This is ideally what I'd like to do, but I'm afraid I'll encounter the "gets all related publications, not just those in refs section" issue that Maximilian describes.

I'll experiment and be back.

ADD REPLY
0
Entering edit mode

Yes, you could go that way. However, you then have the problem of filtering for 2005 PMIDs, which is avoided if you start with 2005 PMIDs and go the other way. I see your problem with "all related versus all cited" - I'm not sure which are returned by ELink.

ADD REPLY
0
Entering edit mode
11.4 years ago
cdsouthan ★ 1.9k

SRS used to be able to do this but a complete GenBank is no longer indexed. But RefSeq still is so ([refseqgenrel-Year#2005:2005] > parent ) returns 305071 entries

ADD COMMENT

Login before adding your answer.

Traffic: 1545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6