Here is a good overview of using makefiles for bioinformatics analyses. A makefile is a way to write out a DAG of targets and dependencies.
To write a DAG in JSON is certainly doable (see convoluted example below) but for lack of finding anything prepackaged, it has seemed to me a lot more work, since I have to design my structure of ordered targets, dependencies and parameters as a nested set of JSON objects and lists, and I had to write all the external code to validate the graph structure and components, as well as process dependencies into target end products.
JSON is more readable, but it is also more verbose ("heavyweight") as a consequence. Making sure all the bits and pieces are in the pipeline document seems a strong prerequisite. If you want to go this route, you might consider looking into JSON Schema to design a "meta"-language or schema for your graph, which can be used to help ensure individual instances of a pipeline are correct before processing. You might write a schema, and then write a JSON-formatted pipeline that validates to your schema.
Here is one very rough example of such a schema document, which defines inputs (sets of genomic intervals, essentially), operations applied on those sets to create outputs, and a vocabulary of properties and parameters that might be useful for staging and processing (datetime stamp, ID fields, descriptive metadata, etc.):
The following is an example of a JSON-formatted instance of a processing pipeline, which would validate against this schema. The goal is to show a graph that would take transcription start sites, filter them for belonging to the CTCF factor, and then apply the equivalent of a bedmap
operation against them and a list of promoter windows of interest on chromosome 16:
It would be the job of whatever service parses this JSON request or payload to decide which genomic sets are inputs that exist (dependencies) and which are targets, yet to be made, which require backend processing steps.
There are various libraries written to process JSON and validate JSON against a JSON Schema document. In Python, for instance:
$ python
Python 2.7.6 (default, Jul 9 2014, 20:49:24)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from jsonschema import validate
>>> schema_fh = open("BEDOPSWebRequestSchema.json", "r")
>>> schema = json.load(schema_fh)
>>> test_request_fh = open("SampleWebRequestPayload.json", "r")
>>> test_request = json.load(test_request_fh)
>>> validate(test_request, schema)
If the request doesn't validate, a ValidationError
exception is thrown with errors that point to the offending JavaScript object in the request. If the request validates, that doesn't mean there couldn't be problems with the schema, but it's a good start for testing and validation.
Maybe there is a suite of tools written that do all of this already, but I wasn't able to find one. Hopefully someone more knowledgeable will comment, or hopefully this post gives some ideas of what could potentially be done.
GNU Makefile seems to do a lot of the heavy lifting and the tools to process one are ubiquitous on UNIX systems like Linux and OS X, so it is perhaps reinventing the wheel to translate this system to another language. I mean, all that JSON above can basically be reduced to something like:
$ grep 'CTCF' TSS.bed | bedmap --chrom chr16 promoters.bed - > answer.bed
A makefile from this is not very long or complex to read:
all: ctcf_tss.bed answer.bed
ctcf_tss.bed:
grep 'CTCF' TSS.bed > $@
answer.bed: ctcf_tss.bed
bedmap --chrom chr16 promoters.bed $^ > $@
Further, if any dependency changes (say, the set of transcription start sites changes) then only targets downstream of changed dependencies get remade, which is more efficient. This could be done with a JSON-based approach, but it requires coding.
A well-written Makefile should be (mostly) self-documenting to a bioinformaticist.
Can you point to an example of a MAKE file that you think is good. I think of them for building software not for running analyses
I've just written simple one: https://github.com/lindenb/ngsxml