Suppose I'm looking at a GEO entry like so: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM946533
Notice at the bottom is a table of 'Supplementary files', containing broadpeak and bigwig files. I'm wondering how to download these using Entrez.
I've tried a variety of approaches, the "closest" I've gotten is the following:
esearch -db gds -query GSM946533 | efetch -format docsum > docsum.xml
This gives me several tags for "suppFile", and they contain the correct file types:
<suppFile>BIGWIG, BROADPEAK</suppFile>
(this result is the last suppFile
in the docsum.xml result) However, that's about as close as I can get to touching these files. Obviously I can just got to the webpage and HTML download them, but I'm wondering if there is a command line method here, or no
edit: "by closest to touching these files", I mean, so far I haven't been able to get Entrez to even fetch the filenames, just their types. Although I do get something like this, I don't think these are the files I'm looking for (they'll be formatted like GSM946533_mm9_wgEncodePsuHistoneG1eH3k04me3ME0S129InputPk.broadPeak.gz
(notice the difference in how the histone marks are written... and even if these were the files, its not clear to me how to download them)):
<suppFile>BIGWIG, BROADPEAK, TXT</suppFile>
<Samples>
<Sample>
<Accession>GSM946525</Accession>
<Title>PSU_ChipSeq_Megakaryo_H3K4me1</Title>
</Sample>
<Sample>
<Accession>GSM946545</Accession>
<Title>PSU_ChipSeq_G1E-ER4_H3K9me3</Title>
</Sample>
<Sample>
<Accession>GSM946548</Accession>
<Title>PSU_ChipSeq_CH12_H3K9me3</Title>
Whoa! Thank you! Exactly what I was looking for!!
I added onto your pipeline:
sed 's/^ftp/https/; s/$/suppl\//'
to reformat easier :)