Hi all, I have say 100 data set names from GEO, and for each, I want to pull out a reference using Efetch.
so for example, for the list:
GDS5204
GDS520
GDS4925
The output should be:
GDS4925 D'Souza M, Zhu X, Frisina RD. Novel approach to select genes from RMA normalized microarray data using functional hearing tests in aging mice. J Neurosci Methods 2008 Jun 30;171(2):279-87. PMID: 18455804
GDS520 Blalock EM, Chen KC, Sharrow K, Herman JP et al. Gene microarrays in hippocampal aging: statistical profiling identifies novel processes correlated with cognitive impairment. J Neurosci 2003 May 1;23(9):3807-19. PMID: 12736351
GDS5204 Lu T, Aron L, Zullo J, Pan Y et al. REST and stress resistance in ageing and Alzheimer's disease. Nature 2014 Mar 27;507(7493):448-54. PMID: 24670762
I am trying to use an Efetch command:
esearch -db GDS -query "GDS5204[ACCN]" | efetch -format docsum | xtract -pattern DocumentSummary -element PubMedIds
The command "runs" (as in no error), but there's also no output at all. I've tried editing different parts of the above command but I still just get the same thing.
Would someone know how to edit the above command to get the output described above (i.e. full reference for a set of data sets)?
Thanks
I really appreciate the help.
Unfortunately, when I type:
or
I get the same as before; no error, but also no output. Would you've any ideas?
When I type:
I can see the full XML output though, so I know it's not a connection etc problem. I did also look up that link you sent, I appreciate it, it's just something that's new to me, so that's the link I've been following to get this far, but now I'm just a bit stuck.
Your first query works on my comp and gives PMID as output (24670762)!
Thanks, that's strange, on either a work server, or on my home mac laptop (no server); when I type:
I can see XML output.
When I type:
It says (as expected) "No command-line arguments supplied to xtract"
But then when I type:
I get nothing at all.
Do you think it's strange that when I change the above command to:
I similarly get no output. So I think this means that it's not picking up the pattern at all, so then none of the elements matter?
I can see from the Extraction Arguments section that pattern tells the command what part of the XML file to look at, and I can see that PubmedArticle is one of the sections from the full XML output, so I don't know what pattern can't pick it up?
You are missing
/PMID
fromMedlineCitation/PMID
. Also, there are parsing issues on biostars which are changing some of the commands. I'm highlighting them differently nowTry these, they are working on my computer
Gives back the PMID
efetch -db pubmed -id 24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID
PMID and author's name
efetch -db pubmed -id 24670762 -format xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID -block Author -sep " " -tab "" -element "&COM" Initials,LastName -COM "(, )"
I really appreciate it. I think there's something wrong with either my computer (just a standard mac terminal) or my connection to entrez. As you can see from the attached image here, even when I type the first command, I just don't get the PMID back. Thank you for your help though, I appreciate it.
That's very strange!! Let's do some check: Are you able to run these commands individually?
Does it produce 1) tmp.xml 2) Output as 24670762 If not, your
xtract
command is not working. you may reinstall from etuilsThanks so much, I officially give up I think I'd rather do >100 references manually at this stage. Command 1 works (as in gives me the tmp.xml file); command 2 doesn't work (no output).
So I re-installed e-utils:
Which says it was successful installed. And then I typed the two commands again, and again, the first one works and the second one doesn't. I don't want to waste any more of your time, I really appreciate the help you gave me.
Thank you.
I am sorry; it's not your fault. e-utils is notorious in not being documented properly and not giving proper error message :-( But I have a deeper look and probably I understand what is the problem. Can you do the following ( I assume you are on mac) ?
xtract.Darwin
from here: ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/xtract.Darwin$ chmod +x xtract.Darwin
Just run it on command line:
$ xtract.Darwin
=> This should give an error somethign like "ERROR: No command-line arguments supplied to xtract"If 1-3 passes as described, then try this commandline (note xtract.Darwin in cmdline)
cat tmp.xml | xtract.Darwin -pattern PubmedArticle -element MedlineCitation/PMID
If 4 passes, then add this
xtract.Darwin
to your edirect folder, and it should now work normally. ie.cat tmp.xml | xtract -pattern PubmedArticle -element MedlineCitation/PMID
In case of error in any step, please paste the complete error here (with step number)
I appreciate it; Everything works great, until step 4, where I get no output at all, so there is no error I can paste (and I checked, tmp.xml is not empty).
I emailed the NCBI help desk this question:
and they replied:
My problem is; I do get the output that they describe "GDS4925 18455804 24587312", but I still need to put it into this format:
GDS4925 D'Souza M, Zhu X, Frisina RD. Novel approach to select genes from RMA normalized microarray data using functional hearing tests in aging mice. J Neurosci Methods 2008 Jun 30;171(2):279-87. PMID: 18455804
This is one of the weirdest things I'm seeing in a while! Did you use
xtract.Darwin
, that was dloaded in step 1 for the step 4?cat tmp.xml | xtract.Darwin -pattern PubmedArticle -element MedlineCitation/PMID
If yes, could you upload your tmp.xml and xtract.Darwin somewhere for check?
Yes no problem, so this is my xtract.Darwin:
http://www.filehosting.org/file/details/664811/xtract.Darwin
and this is my tmp.xml (for GDS4925, the dataset in above example):
http://www.filehosting.org/file/details/664812/tmp.xml
Thanks for your help. I've also emailed back the help desk to ask them this question again/explain that I still can't get the xtract command to work.
Ok, this is what I was guessing and was afraid of: your tmp.xml file is truncated (means there is some issue with dload of this file from NCBI server). See my tmp.xml for reference https://ufile.io/nfwj8
Firs check if this is consistently happening with other PMID queries and not for just this PMID. If that's the case, you need to better check the connectivity issue. Are you behind some firewall?
PS You may also use some options to debug if this is a connectivity issue:
https://www.ncbi.nlm.nih.gov/books/NBK179288/
Thank you; I emailed back the help desk to ask them why the file was being truncated (this happened whether I was at home or in work); they replied:
Unfortunately, web displayed citation format is NOT a format supported by API.
There is no good way to regenerate that format through fetched content. JSON format returned by esummary given a pubmed id, contains all the pieces. One need to piece them back together in a human readable form. The following is a thread on this topic:
http://www.alexhadik.com/blog/2014/6/12/create-pubmed-citations-automatically-using-pubmed-api
So unfortunately, I think I may need to just get every reference manually. Thank you so so much for your time.
This whole episode has become a detective movie for me! The main Q I am trying to get answer is that if I am able to do it at my end, why you are not? If the format is not supported by API for you, it must be the same for me. Also XML is standard format for data exchange; don't know why you have to go for JSON? Moreover, it fetching the file partially. If it were a problem of format or interface itself - I would have expected no output all together! Only if I could get answer to those Qs!
I know, it seems like this would be relatively straightforward, it's very frustrating. Unfortunately, I am not an expert in this particular area, so I don't feel like I know enough (yet) to try and find a workaround or understand exactly what's happening.
In the link, go directly to the section "Extraction Arguments"