Hi All. I'm trying to output a search using the Entrez Direct: E-utilities on the UNIX Command Line. What I want to do is search using esearch, and using xtract output the following format:
pubmedid First Author Last Name / FA first initial First Author Affiliation Date Published
I can get some of the output I need using two different codes, but putting them together would be tricky, and harder than it should be, or so I imagine. The problem, I think lies with two different formats: docsum, and xml.
The first command I've been playing with:
./esearch \
-db pubmed \
-query "search string" | \
./efilter -mindate 2005 | \
./efetch -format docsum | \
./xtract \
-pattern DocumentSummary \
-element MedlineCitation/PMID \
-element Id SortFirstAuthor | \
sort -t $'\t' -k 3,3n -k 2,2f
So that outputs the first two columns as desired. However, the docsum format doesn't contain information about affiliation.
Using this command:
./esearch \
-db pubmed \
-query "search string" | \
./efilter -mindate 2005 | \
./efetch -format xml | \
./xtract -pattern PubmedArticle \
-element MedlineCitation/PMID \
-element Id SortFirstAuthor Affiliation \
-block PubDate \
-sep " " \
-element Year,Month MedlineDate | \
sort -t $'\t' -k 3,3n -k 2,2f
I get the pubmedid, and all affiliations of every author on each publication.
Does anyone know how I might tweak either of these codes? Is it possible to ignore all other authors except the first other?
All help is appreciated.
Can efetch be used to search for all publications matching a string, similar to my example where esearch is used?
I didnt read your question carefully. I thought you needed author information for a specific pubmedid. Efetch works when you already have a pubmedid. It doesn't work with all the entrez databases. You will have to write a Esearch-> Efetch pipeline as explained here. OR you can use Biopython module that uses Esearch and returns list of PMIDs and then you can get the authors information using Efetch. I am sure Biopython will have some module for Esearch.
Go through this tutorial. You can see how it used a string to search Pubmed for all the PMIDs that have that string in them using "einfo". Once you have the PMIDs , then you can use "efetch".
Ok, so I will modify the first command,
to now only print pubmedid's. From there, I can use BioPython, with your example, to get the information, per pubmedid. What I really want is only the City/State/Country from each first author affiliation, and can parse that out later.
You will have to write your own code to extract information at that specificity level. Try to understand the structure of the output and then write a parser that will retrieve city, country etc. The "affiliation" information can't be accessed individually as city, country, state from pubmed. It will be extracted together as a big affiliation string and then you will have to parse whatever you need from that string. PS: I have updated a link in my previous comment. please go through that tuorial and it should be very helpful. Thanks.
Please check the new code.
Thanks, Ashutosh. I'm familiar with Python and that Biopython has can deal with these kinds of tasks. I was hoping to get it all done with entrez edirect.
Well I also have a code that doesn't use Biopython and use pure eutils and my own xmlparser but I think Pierre's code would be much better than mine. Good luck.