After reading a review of bioinformatics pipeline frameworks I've started testing a few to see if I could create a simple ChIP-seq peak-calling pipeline. However I always seem to run into the same problem - what format should I put my metadata in, and how can I specify sample comparisons (for example ChIP-factor against naked DNA) such that the pipeline is flexible to different numbers of samples and comparisons?
With Snakemake you can provide a .yaml or .json file to describe the experiment and the comparisons, but in a large study doing this by hand is quite tedious (I wonder if there is a nice way to parse a .tsv sample design file into a .yaml or .json configuration?). Alternatively if I were to use GNU Make all of the comparisons would have to either be specified manually in the Makefile, or a single rule would be used to call a script which runs the comparisons by parsing a sample design file. In the latter case I can't take advantage of Make's parallel jobs feature and would have to use the parallel processing modules available to the scripting language.
I feel that whatever pipeline I wrote with these frameworks would continuously have to be changed or updated to accommodate new numbers of samples or comparisons. When really all I want is to be able to provide a new sample design file and have the pipeline adapt to the new design without too much extra tweaking. Maybe I'm expecting too much, but is there a particular representation of metadata you've found works well, or could you recommend a different framework which handles metadata and comparisons directly?
In the case of snakemake, you can get pretty far with parsing a TSV file and using conditionals in the input/output/params section. I've yet to see a perfect solution for metadata handling, though.
After a bit more testing I agree that parsing a TSV file is probably the best way to handle metadata until a framework with integrated support comes along. For now I'll use a script I wrote to convert the sample sheet into yaml format which Snakemake understands.
I'm not sure I understand exactly what you mean with metadata. Could you clarify that a little? Do you mean a description of the outputs you want the workflow to produce?
Part of your description sounds like something I've been ranting before, about the need for dynamic workflow scheduling, which means you can do computations where the number of tasks (be it comparisons) is determined based on the output of an earlier task in the workflow. That is typically possible with data-flow based systems, and so I would expect it to be possible in nextflow, and also in my own experimental library scipipe, where the concrete need for dynamic scheduling was the whole motivation behind writing it.
In this case, I would interpret metadata to be things likes groups and control samples that are sometimes, but not always, be used for normalizations. The classic example would be ChIPseq, where different ChIPs will have different controls. In an ideal world these would have some consistent naming scheme, but on really large projects that may not be a reasonable assumption.
A database with a good API would be a good solution.
Hi James,
Sorry to be late to this discussion, but I couldn't find any other way to contact you.
I am not quite clear on one thing: by "Number of samples" do mean number of input files to be merged for one sample? Because for a Chip-factor vs. input, you're not talking multiple actual samples, right? Or are you talking about something like ChIP differential peak analysis which uses multiple samples? Anyway, I'm not quite clear on the question but I think you could do this pretty easily with a new system I'm working on called looper... for simple comparisons, you would define a tsv with one line per comparison, and then write a pipeline that runs on the tsv. we use a merge table you define (another tsv) to allow you to define any number of inputs in either category.
It's still early days and I've been trying to think of ideas for how to do something like this better. So since you've been thinking about it, if this doesn't solve your issues, I'm all ears.