Question

Pipeline version control system, pipeline storage

0

Entering edit mode

3.3 years ago

Anastasiya M • 0

Hello! Please take a few minutes. Your answer to the question is important to me)

Do you use a version control system for your pipelines? If so, which one?
Where do you store your pipelines?
Is there someone who doesn't use anything for storage and versioning?

pipeline git storage • 1.3k views

ADD COMMENT • link updated 3.2 years ago by i.sudbery 20k • written 3.3 years ago by Anastasiya M • 0

2

Entering edit mode

git git git

In your version control system you can use any workflow manager. See here for few examples ^^: https://github.com/pditommaso/awesome-pipeline

See an example of Nextflow pipelines hosted in github here: https://github.com/NBISweden/pipelines-nextflow

Then for better reproducibility your pipeline can use conda or even better: containers (docker/Singularity). The example above use both. You choose with a parameter if you want to run the pipeline using conda environement or docker/singularity containers.

ADD REPLY • link 3.3 years ago by Juke34 8.9k

0

Entering edit mode

Additionally, I would suggest using an isolated environment, like Conda, which track tools version you used.

ADD REPLY • link 3.3 years ago by Medhat 9.8k

0

Entering edit mode

Q1) conda is a good option - you can setup a conda environment for each pipeline (or shared environemnts, as applicable), and then 'activate' the appropriate conda for a given pipeline

Q2) git and/or confluence

ADD REPLY • link 3.3 years ago by noodle ▴ 590

score 1 · Answer 1 · 2021-08-24

The answer to this depends on what you mean by storeing/version controlling a pipeline.

If you mean the code of the pipelines - the script, or CWL definition file or make-file, then absolutely this should be version controlled using git, as mentioned by the others. As for where we store code - we tend to divide pipelines into production pipelines that we expect to use over multiple projects - we will have clones of these either in the groups shared storage or on people home directories. For project specific pipelines, the code with live in a src sub-directory of the project directory.

But another meaning of the question could be "Do you use version control for pipeline runs? Where do you store pipeline runs?" Where by "runs", I mean the collection of input files, configuration files, intermediate files and output files that arrise by running a pipline on some input files.

Such collections of files are often (very) large and binary, and unsuitable for traditional version control, which works best for shortish text files. However, the beauty of pipelines should be that data + code + configuration + cpu time = results.

Thus, for us, the ideal is (and I'm not saying we always manage to live up to this):

For each pipeline run a git repo is initiated inthe pipeline run directory, or the pipeline run directory is added to a project repo.
In the repo we put: the pipeline configuration file, the pipeline log file, an automated script that will generate/link/copy input data files for raw data stores (our own /Raw_data dir, or GEO etc).
In reality, because of the limitaitons of our HPC system, we run pipelines on the lustre filesystem attached to the HPC - but this can only be used temporarily, so the config, log etc files are actaully created on the long term file store and then linked to the fast, short-term storage.