Is there an easy way to query Genbank for all sequence submissions associated with ANY journal article published in 2005? So, all sequences that have a JOURNAL metadata line that includes "2005" (and ideally a real journal name and/or PMID) but without a related TITLE of Direct Submission. I'm interested in datasets associated with ANY AND ALL 2005 publications.
Using a filter that includes "2005"[pdat] will limit the responses to sequences that were deposited in 2005, but this isn't quite what I want.
Can metadata in the REFERENCES section be filtered using the NCBI web interface? Or any other web interface? Or can I do some creative PMID linking and filtering thing? Or am I best off downloading the metadata (using eutils?) and filtering it with scripts offline?
As far as I can see, you can achieve this by querying the NCBI Nucleotide database fields [Journal] and [Publication date] to achieve this. You can do this via the Entrez web interface or through eutils esearch (and then subsequent fetch the actual entries based on primary IDs):
Thanks, Lars. I wasn't clear in my post originally... I'm actually looking for datasets with links to ANY AND ALL journal articles, so I don't think this approach will get me what I need. I tried to NOT "Unpublished"[Journal] but that doesn't work (and isn't ideal anyway)
If you filter on journal you will get ALL articles linked to a sequence. This will include mostly articles that have been annotated manually to related to the sequence in question. E.g. if a genbank record contains the sequence of TNF, the Genbank or Medline annotators (?) will add the paper to the Genbank record (at least they did it for some time).
So, to get the real answer, you will need to download and filter the records with your own scripts. Downloading is easy, as you probably don't want the high-throughput section. Filtering is easy, as all Bio* scripting language libraries contain parsers for Genbank files. Then you simply look at the LAST reference of the Genbank record, as explained here: http://www.ncbi.nlm.nih.gov/Sitemap/samplerecord.html#SubmitterBlockB
I have some old scripts that did this type of thing, if you need a starting point.
Thanks, Maximilian, your first paragraph is a very helpful clarification and consistent with my experience. Yes, pointers to some starting scripts would be great, if that is easy?
You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?). Replace the .txt to .py.
Then run it like this:
genbankToTables [?] [?] --bigTable --onlySubmitter
Hi Heather, try this script: http://genomewiki.ucsc.edu/index.php/Image:GenbankToTables.txt (download the file, do not copy & paste). You need a not-too-old python version (>2.5, published in 2006) and biopython installed (sorry, but genbank parsing always requires some libraries. Are you using linux?).
Replace the .txt to .py. Then run it like this: python genbankToTables.py [?] [?] --bigTable --onlySubmitter -
Start by obtaining a list of PMIDs for articles published in 2005. You can do this using either EUtils or the PubMed website, by searching for 2005[DP]. There are 693 394 entries. If using the website, select "Send to File" and choose "PMID List" as the output.
Then either:
Use the EUtils elink query
Here is the ELink documentation. There are examples at the end. Basically, you would obtain the list of 2005 PMIDs as described above, then construct a URL to link PubMed and Nucleotide databases, along the lines of:
dbfrom=pubmed&db=nuccore&id=PMID
This will retrieve UIDs for query/retrieval of the nuccore database.
Using the NCBI FTP site
Go to a very useful directory in the NCBI FTP site named /entrez/links. It contains compressed text files which link UIDs between pairs of Entrez databases. You'll find, for example, gene_pubmed.lnk.gz, nucleotide_pubmed.lnk.gz and genome_pubmed.lnk.gz.
Now, all you have to do is parse the appropriate list to find the 2005 PMIDs, extract the corresponding sequence UIDs and use them to query the appropriate sequence database.
UPDATE: I just noticed that the file dates in /entrez/links are rather old; in fact, they are dated 04 which is not much use for 2005! You may have to dig around the FTP site for more recent files - I'll see if I can find them.
Neil, as a modification of #1 above, would it work if I started with the Nucleotide database ID and then link through to its PubMed IDs using
dbfrom=nuccore&db=pubmed&id=NUCCOREID
?. Then I could eliminate from my list all NUCCOREIDs that don't lead to PMIDs with a 2005 PMID date.
This is ideally what I'd like to do, but I'm afraid I'll encounter the "gets all related publications, not just those in refs section" issue that Maximilian describes.
Yes, you could go that way. However, you then have the problem of filtering for 2005 PMIDs, which is avoided if you start with 2005 PMIDs and go the other way. I see your problem with "all related versus all cited" - I'm not sure which are returned by ELink.
SRS used to be able to do this but a complete GenBank is no longer indexed. But RefSeq still is so ([refseqgenrel-Year#2005:2005] > parent ) returns 305071 entries
Thanks, Lars. I wasn't clear in my post originally... I'm actually looking for datasets with links to ANY AND ALL journal articles, so I don't think this approach will get me what I need. I tried to NOT "Unpublished"[Journal] but that doesn't work (and isn't ideal anyway)