Forum:How to enable scientific software reproducibility?
1
2
Entering edit mode
10.2 years ago
martenson ▴ 380

To make long story short I would like to ask you for input about using Homebrew as a means to achieve better reproducibility in scientific software pipelines.

Please see more context in the official repo of homebrew-science here: https://github.com/Homebrew/homebrew-science/issues/1191

versioning software galaxy reproducibility • 2.0k views
ADD COMMENT
0
Entering edit mode

moved to "forum"

ADD REPLY
0
Entering edit mode

Brad Chapman is right when he brings up Docker. Docker is a much more complete solution for dependencies than Homebrew.

ADD REPLY
0
Entering edit mode

As I understand it Docker serves as a container to run things. In order to build the container you need to use something like Homebrew anyways.

ADD REPLY
1
Entering edit mode

That's true certainly dependency fetchers can be used by developers to build a dock. Honestly where the brew concept might be most attractive is for data. Right now I have scripts that are full of urls like:

ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/technical/pilot3_exon_targetted_GRCh37_bams/data/NA12891/alignment/NA12891.chrom22.ILLUMINA.bwa.CEU.exon_targetted.20100311.bam

Honestly this is better maintained as a brew recipe by someone who works with data:

brew install 1kg_ceu_trio_bams

ADD REPLY
3
Entering edit mode
10.2 years ago

Reproducibility is also dear to my heart - though lately I have be asking myself what does "reproducibility" even mean?

Does it mean that we should be able to run a pipeline with the exact same versions of the programs and the exact same parameters and they will produce the exact same answer no matter what? Nowadays I've come to think that this type of reproducibility is not all that useful.

The reproducibility that I hope from science is that of scientific observations and results.

I recall a paper (published in Nature) where (as it later turned out) choosing the size of the upstream region to be exactly 1000bp was the critical parameter to all subsequent results. The study would not produce the same results for 900bp nor with 1100 bp "upstream" regions. Basically the genes seemed to be regulated by upstream binding only when 1000bp was chosen as to what upstream meant ... that's some insight alright...

So perhaps the exact opposite is required, if a study cannot be reproduced by a similar but different approach it is likely to be a case of overfitting.

I don't want to disscourage the homebrew integration though. I think it would be very valuable and essential. I wish we could easily install bioinformatics software with a single command when we want it. Being able to run a recently published tool and other alternative approaches with ease would make the process of reproducing results so much simpler.

But what I absolutely don't think is necessary is to install an old version X of software Y just because someone used that version in their analysis some time ago. That just basically says: let's ignore everything that we learned since then and rewind to a time when the software was worse than it is today, we knew less and expected less.

If a study only works with an old version of just one software it is very likely not worth reproducing it.

ADD COMMENT
2
Entering edit mode

I think what you're describing is better termed "robustness". I'd agree that it's more important though (it's the same reason people have to look at how numerical simulations are to changes in input parameters prior to publication).

ADD REPLY

Login before adding your answer.

Traffic: 2973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6