Hi Dominick,
One thing I am not sure of is, how is it that you got the other PMID in the first place?
The reason I ask is that you mentioned the Bio
package .. Bio
is a very lovely wrapper written (by the founder of Biostars) around the eutils
published by the NCBI. Why say this? Well, because using either eutils
or Bio
to generate the PMIDs, rather than to pull them, could circumvent the problem you describe for exactly the reason GenoMax has indicated above.
But wouldn't this lead you with a second problem - that you would now have PMIDs unlinked to the information you want (i.e. being affiliated with the NIH grant awards program)? Not necessarily ...
Pending further knowledge about this award program (which you have but I don't), I might do something like this:
- Fetch a superset that includes all PMIDs in the last 20 years using
esearch
(this will give you just the IDs only).
- Using the IDs from 1., now generate a detailed summary for all records obtained detailed data on the union using
esummary
.
- Pay careful attention to all the available fields during this
esummary
, in particular to fields associated with grant numbers (which seems to be what you're after).
- Link PMIDs to the grant award of interest directly, without dealing with deprecated PMIDs at all.
For 2. and 3. (the fields of interest part specifically), note the commented line of Python code:
# Perform the PubMed search
handle = Entrez.esearch(db="pubmed", term=search_query, retmax=1000)
record = Entrez.read(handle)
handle.close()
# Fetch details for each article
articles = []
for pmid in record["IdList"]:
article_handle = Entrez.esummary(db="pubmed", id=pmid)
article_record = Entrez.read(article_handle)[0]
article = {
"Title": article_record["Title"],
"Authors": ", ".join(article_record["AuthorList"]),
"Journal": article_record["FullJournalName"],
"DOI": article_record.get("DOI", ""),
"PMID": pmid,
"Grant Numbers": ", ".join(article_record.get("ArticleIds", {}).get("GrantList", [])) ###### It is possible this field or another field will contain this information directly depending on your award info.
}
articles.append(article)
A complete workflow that will make very quick work of this query is implemented here: https://github.com/LauferVA/EntrezMetadataTools/blob/main/libs/query_api.py. This workflow will download, annotate and reformat the metadata associated with all 10M seq records in SRA in just over 2 hours on an ordinary laptop, so 120k should not be an issue even without an API key. In the link provided, you could modify line 148 to point at pubmed rather than sra, then modify line 149 to reflect your query terms.
Other solutions could be imagined, too (e.g. HTML parser for the exact string a/w redirection). Happy querying.
VL
how is this query itself originating?
I have a list of PMIDs for publications linked to a list of grants, exported from NIH RePORTER.
yep. in this case id definitely start with the grant numbers themselves as others have indicated. i did not recommend this before due to uncertainty regarding the type of award being described as well as the phrasing of the original post.
but thats the better approach in this case