Hello,
I am developing customized pipelines for ChIP-seq analysis using Snakemake. I want share it, so I created model workflows that people can execute immediatly after downloading the code. It handles file conversion, mapping, peak-calling... And uses public data from GEO database. However it requires people to download these data themselves. I would like to include an automatic download of the data (sra or fastq files), ideally by using GSM/GSE or SRR identifiers.
So far I've found several ways:
* SRA toolkit's fastq-dump function.
fastq-dump --outdir <outdir> <srr_ids>
However this way is insanely slow (as stated here).
* SRAdb R package
getSRAfile( in_acc = "<srr_ids>", sra_con = sra_con, destDir = <dir>, fileType = 'sra' )
This requires using this command first:
geometadbfile <- getSRAdbFile(destdir = <dir>, destfile = "SRAmetadb.sqlite.gz")
which downloads locally an sqlite file of 16Go. Could be fine if I were to use it locally, but I don't want users of my pipeline to be forced to do so...
* Biopython's Bio.Geo module
Not sure how this one works... http://biopython.org/DIST/docs/tutorial/Tutorial.html#htoc123
The object Entrez.esearch doesn't help me finding out the ftp URL or so.
I think there should be a way to download data in a more simple way?
Any idea will be greatly appreciated!
This is not related to the main question, since my experience is limited and wouldn't be very useful. But I am interested in testing these pipelines of yours if they are publicly available!
Hi, thanks for your interest. My code is available on GitHub. Please note that it's under development, and there's still a lot to do! I'm also developing a virtual machine, in order to simplify the distribution.
I sympathise with your frustration. Getting data and metadata from GEO programmatically doesn't seem to be straightforward.
Just to be clear, GEO and SRA are two totally separate databases and GEO does not host sequencing data at all.
True; the main use of GEO would be finding common experimental data sets (would have to use elink to grab the SRA information).