Question

I want to store a docker image reproducing a paper. Is there a host for docker images of studies?

5

Entering edit mode

8.4 years ago

endrebak ▴ 980

I want to store a docker image with the input data, code, software, intermediate results and results. This will be a massive file.

Is there something like the sequence archive for people who want to upload a docker image of their whole study? If not, what is the best/cheapest way to store such a file?

Any related thoughts not directly answering the question are also welcome.

docker reproducible • 3.0k views

ADD COMMENT • link updated 8.4 years ago by Giovanni M Dall'Olio 28k • written 8.4 years ago by endrebak ▴ 980

1

Entering edit mode

Maybe quay.io? I'm not sure what sort of size limits they have.

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

1

Entering edit mode

What if you send it to the Journal, as a Supplementary file? Some bioinformatics-related journals have no limits for file hosting.

ADD REPLY • link 8.4 years ago by Giovanni M Dall'Olio 28k

score 4 · Accepted Answer · 2016-08-09

4

Entering edit mode

8.4 years ago

Ryan Dale 5.0k

figshare might be an option. You may need to split the data into chunks and provide a download script, as in this example (6GB of compressed data split into chunks), though their individual accession size limits may have increased since then.

ADD COMMENT • link 8.4 years ago by Ryan Dale 5.0k

0

Entering edit mode

Good suggestion for the data portion. I didn't even think of FigShare, I couldn't remember what their individual data limits where.

ADD REPLY • link 8.4 years ago by DG 7.3k

score 4 · Accepted Answer · 2016-08-09

4

Entering edit mode

8.4 years ago

John 13k

I get the point of using a Docker image is for reproducibility, but the key selling point of Docker is it's modularity! It seems slightly blasphemous to put everything onto a Docker image. Why not just a VM image if you're going to make an offline reproducibility archive of multi-gigabytes?

Instead I would give some serious thought to what genomax said, which is to have a Docker image which automates the process of downloading, decompressing, etc, all the raw public data and turning that into the final result instead. This way, your Docker image would be tiny, you don't have issues with the Docker data and the public data falling out-of-sync if there are ever corrections needed (as a general rule of thumb, there should only be 1 place to download the data from). And of course, you can update your Docker image of 100Mb much easier than updating your Docker image of 100Gb to fix a typo in a script.

As an aside, it seems to be a very popular these days when met with the question of "how can I reproduce this in 10 years time?" people think of the future as being some cataclysmic hellscape where nothing works anymore. Some poor future Bioinformatician slumped over a green and black cathode-ray monitor mumbling about "the wisdom of the ancients" while his buddies peddle bicycles to generate power. All to reproduce the RNA-Seq findings for some 10-year-old study.

As someone typing this while playing Pokemon Red via Gameboy emulator on his phone, i'd say the chances of bad code still working in 10 years from now is fairly high, so long as the code was fairly popular at the time :)

ADD COMMENT • link 8.4 years ago by John 13k

2

Entering edit mode

As someone typing this while playing Pokemon Red via Gameboy emulator on his phone

Anything to avoid writing your thesis up @John? Or has that been done :-)

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

Hehe, if it's not Professor Oak telling me to focus, it's Dr. Genomax heheh :D (but yes I should really get back to work!)

ADD REPLY • link 8.4 years ago by John 13k

1

Entering edit mode

As someone typing this while playing Pokemon Red via Gameboy emulator on his phone

You should really get a DS... Platinum and HeartGold are so much better! You'll only look ~30% more nerdy.

ADD REPLY • link 8.4 years ago by Brian Bushnell 20k

1

Entering edit mode

Hahaha, I don't know, the first 151 I can get behind, but if I learn anymore abstract names with deeper meanings implied, i'll probably end up conflating Pokemon with gene names.

"Here we can see a 20% increase in the amount of SHINX in our Bidoof cell-line."

To be honest, no one would probably notice anyways. I recently sat through a lecture where the premise was all individuals have the power to phenotypically transform if the environmental stressors were significant enough during development, else our true phenotype is blunted by epigenetic buffers. Essentially, epigenetics holds our "true phenotypic potential" back until we need it. Cue the Japanese rock music and collectable keychain merchandize.

ADD REPLY • link 8.4 years ago by John 13k

1

Entering edit mode

Bioinformatician slumped over a green and black cathode-ray monitor mumbling about "the wisdom of the ancients"

Sounds like me before coffee every morning.

Edit: Oh, and finish your thesis!

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

1

Entering edit mode

Yes, this is slightly blasphemous, but also the most foolproof/easiest way. Then people could just find the files they want without running a script and nothing could go wrong with downloading.

ADD REPLY • link 8.4 years ago by endrebak ▴ 980

2

Entering edit mode

Yeah, no, I totally get why a self-contained package would seem like a very logical and convenient solution, but honestly this is one of those classic examples of wanting to do something pure and sensible, but the mess of real life presents you with no real good options to achieve that. The last paragraph of that story is pretty much your/our exact situation.

What is the most likely instance of someone looking to reproduce your results manually? Most likely, they are already using the data but can't reproduce the results and they don't know why. If you bundle code with data, they will have to re-download all the data again. More over, to get that Docker container they have to download it - implying they have a working internet connection and could have downloaded the data separate from the Docker container as per Genomax's suggestion.

Finally, if I can't convince you with fluffy philosophy, know that if you download a Docker container from some website you'll probably do so with 1 TCP socket, and 1 CPU thread. If you get Docker to do the downloading from public repos, there is no such limitation and the user can probably max out their download bandwidth :)

ADD REPLY • link 8.4 years ago by John 13k

score 3 · Accepted Answer · 2016-08-09

3

Entering edit mode

8.4 years ago

DG 7.3k

Storage of very large Docker images is potentially a bit of an issue. As Devon suggested you can check into Quay.io, but I don't know if they have any file size restrictions. In practice downloading a large single file can be a bit of a problem. You might consider separating the data storage from the docker image itself that replicates the code aspects and software. Then you could put the Docker image anywhere Quay.io or just Docker Hub) and store all of the data in a public Amazon S3 bucket (or another cloud storage system with a nice API). The Docker Image could have code for pulling the data from AWS

ADD COMMENT • link 8.4 years ago by DG 7.3k

1

Entering edit mode

While this is a great suggestion paying for any of this is likely not in plans for @endrebak (I have been wrong before).

Pointing to an ENA accession (since one can get fastq files directly) for the raw data can perhaps be a (free) option instead of amazon. The docker image can include a pre-prepared script to automate the process of downloading the data.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

I have a personal annum I can use for amazon data storage, but if the data are downloaded a lot I might have to pay much? We will see.

ADD REPLY • link 8.4 years ago by endrebak ▴ 980

1

Entering edit mode

You can upload the data to AWS and then make the bucket containing the data requester pays. If you do this you will only get a flat fee for data storage and you will not pay data transfer fees when others access the data.

ADD REPLY • link 8.4 years ago by donfreed ★ 1.6k

0

Entering edit mode

Interesting option. Fine for those who can pay but then you would exclude many others who would not be able to.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

Making it available at amazon and then withdrawing it if you start incurring big charges may frustrate people. If you can make ENA option work then start there since it is likely to remain available long term.

ADD REPLY • link 8.4 years ago by GenoMax 148k

0

Entering edit mode

I agree that if possible ENA or SRA is a good option as well for the data. You may also store data in different places if there is other bulk data you want to provide that ENA/SRA isn't likely to host. Bulk data storage and distribution is often difficult.

ADD REPLY • link 8.4 years ago by DG 7.3k

score 3 · Accepted Answer · 2016-08-10

Here are a few options that come to my mind:

Host the data on a public repository, for example synapse, which has been created for this exact purpose. You can keep the data private during the publication, and then release it afterwards. Hosting is free of charge.
As previously noted, the docker image should be small. In the docker image, download the data from synapse or wherever you host the data, and pull the code from github or bitbucket
if you really need to host large datasets, bitbucket has no space limits (although large repositories may incur into limits)
I remember some bioinformatics-related journals advertising that they didn't have limits for the size of supplementary materials posted. This is to encourage people to upload all the relevant data and facilitate reproducibility. However right now I don't remember the name of the journals using this policy. Sorry :-)