Question

Using Entrez to download Supplementary files in GEO entry via command line?

0

Entering edit mode

21 months ago

ccc ▴ 30

Suppose I'm looking at a GEO entry like so: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM946533

Notice at the bottom is a table of 'Supplementary files', containing broadpeak and bigwig files. I'm wondering how to download these using Entrez.

I've tried a variety of approaches, the "closest" I've gotten is the following:

esearch -db gds -query GSM946533 | efetch -format docsum > docsum.xml

This gives me several tags for "suppFile", and they contain the correct file types:

<suppFile>BIGWIG, BROADPEAK</suppFile>

(this result is the last suppFile in the docsum.xml result) However, that's about as close as I can get to touching these files. Obviously I can just got to the webpage and HTML download them, but I'm wondering if there is a command line method here, or no

edit: "by closest to touching these files", I mean, so far I haven't been able to get Entrez to even fetch the filenames, just their types. Although I do get something like this, I don't think these are the files I'm looking for (they'll be formatted like GSM946533_mm9_wgEncodePsuHistoneG1eH3k04me3ME0S129InputPk.broadPeak.gz (notice the difference in how the histone marks are written... and even if these were the files, its not clear to me how to download them)):

<suppFile>BIGWIG, BROADPEAK, TXT</suppFile>
<Samples>
    <Sample>
        <Accession>GSM946525</Accession>
        <Title>PSU_ChipSeq_Megakaryo_H3K4me1</Title>
    </Sample>
    <Sample>
        <Accession>GSM946545</Accession>
        <Title>PSU_ChipSeq_G1E-ER4_H3K9me3</Title>
    </Sample>
    <Sample>
        <Accession>GSM946548</Accession>
        <Title>PSU_ChipSeq_CH12_H3K9me3</Title>

bash command-line entrez • 810 views

ADD COMMENT • link updated 21 months ago by Ram 44k • written 21 months ago by ccc ▴ 30

score 2 · Accepted Answer · 2023-02-21

2

Entering edit mode

21 months ago

GenoMax 147k

With the following you can get the FTP directory

$ esearch -db gds -query GSM946533 | efetch -format docsum | xtract -pattern DocumentSummary -element FTPLink | grep samples
ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM946nnn/GSM946533/

The files you want are under suppl directory at that link: https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM946nnn/GSM946533/suppl/

So while you can't get the info directly this should get you close. Different samples seem to follow the same pattern of links.

ADD COMMENT • link 21 months ago by GenoMax 147k

0

Entering edit mode

Whoa! Thank you! Exactly what I was looking for!!

I added onto your pipeline: sed 's/^ftp/https/; s/$/suppl\//' to reformat easier :)

ADD REPLY • link 21 months ago by ccc ▴ 30