Question

Automating Your Analyses

8

Entering edit mode

14.5 years ago

Markf ▴ 290

I was wondering what people in general use to automate analyses pipelines (if anything).

I know that there are many tools around, such as Taverna and Galaxy, but I'm under the impression that many people mostly use the command line and build customized scripts. (I certainly do).

So, what do people use to automate analyses, particularly one that needs to process a lot of data, or has to be executed many times over? And why are you using this solution?

<edit> I know that this is almost a duplicate of this question: but my question is broader - apart from organizing pipelines of small scripts, what do you with larger, less research, more production like situations. </edit>

pipeline • 3.1k views

ADD COMMENT • link updated 14.5 years ago by Will 4.6k • written 14.5 years ago by Markf ▴ 290

1

Entering edit mode

I think Makefiles are just as suited to research as to production environments. Originally they were meant for production (translating source code).

ADD REPLY • link 14.5 years ago by Michael Kuhn 5.0k

0

Entering edit mode

What about Cyrille2? Are you still using/developing it?

ADD REPLY • link 14.5 years ago by Egon Willighagen 5.4k

0

Entering edit mode

Hey Egon. Stopped development on Cyrille2, nice for high throughput pipelines that do not change to often - too unwieldy for my current work - use this now: http://mfiers.github.com/Moa/ (again, my own software)

ADD REPLY • link 14.5 years ago by Markf ▴ 290

0

Entering edit mode

Ah, and in Wageningen, the last instance of Cyrrille2 is about to get scrapped...

ADD REPLY • link 13.2 years ago by Jan Van Haarst ▴ 300

Ram · Answer 1 · 2010-06-14

I would answer you in the same way I answered the other question: my pipelines are defined in a lot of Makefiles.

Consider that 90% of the times I don't write very well defined Makefiles, I only use them to define the order in which the rules and scripts should be run.

An example Makefile may be:

PARAMETERS_FILE=parameters/general.txt

help:
    @echo pipeline for the xyz project. 
    @echo  use plots_general to create a general representation of the results
    @echo  use resume_general to generate a csv report of the results
    @echo  customize $(PARAMETERS_FILE) to define the details of the analysis

whole_analysis: plots_general resume_general

plots_general: tables
    Rscript src/scripts/generate_plots_general.R

input_data: $(PARAMETERS_FILE)
    python src/scripts/get_input_data.py --parameters $< --output data/inputdata.txt

tables: input_data
    python scr/scripts/generate_table.py --output results/table.txt

resume_general: tables
    python src/scripts/generate_results.py --input results/table.txt --output results/resume_general.csv

Well I am not very inspired to make a nicer example now, but you see that I use mostly .PHONY rules (without even declaring them as such). This way, I lose one of the advantages of Makefiles, the one that allow to not execute a rule if the output is already updated, but in change the Makefile is easier to read.

If I need to use a rule frequently and I need to define it better, I would write something like this:

input_data: data/inputdata.txt
data/inputdata.txt: $(PARAMETERS_FILE)
    python src/scripts/get_input_data.py --parameters $< --output data/inputdata.txt

tables: results/table.txt
results/table.txt: data/inputdata.txt
    python scr/scripts/generate_table.py --output results/table.txt

Ram · Answer 2 · 2010-06-14

I'm a big fan of python and Paver. It facilitates the creation of REPEATABLE, DEFINABLE and RE-ENTRANT scripts. You can define each "model" of analysis as a separate function: normalization, analysis, figure-generation, etc. You can also construct task-dependency-trees: analysis depends on normalization, second-step of analysis depends on first-set, etc. The code will traverse your dependency tree and execute them in that order, and make sure it doesn't do a task twice.

Even though my analysis code is in python I prefer to make sh calls and then use text-files to pass info between different processes. This also facilitates re-entrant techniques, ie. if the result text file is already present then pass on the rest of the code ... I then use a -f flag to force re-analysis if I have code changes.

Here is a link to an example of one of my paver-files: http://github.com/JudoWill/flELM/blob/master/pavement.py

If you've got a lot of time I also like to use the Galaxy FrameWork. They have a wonderful framework for integrating your own analysis pipelines. I have a local fork of their bit-bucket repository. With a little work you can integrate any pipeline into this framework and then you even have a nice GUI to interact with. Once you get the hang of their code base you can integrate one of your tools into the framework into an afternoon.

My general workflow is to build a tool using the paver framework since its much easier to debug. Once its finished I'll integrate it into the Galaxy framework.

This sort of reproducibility has made my professional life soooo much easier. When researchers come to me with new projects I can return preliminary answers to them within a day based on tools I already have. Then based on their feedback I'll develop a custom set of tools for their project.

Hope that general rambling helps,

Will