Standard simple format to describe a bioinformatics analysis pipeline

Entering edit mode

10.4 years ago

Laura ★ 1.8k

I want to get a collection of different groups to describe their analysis pipelines in a standard way to make it easier to see where people are doing the same thing and where they are doing different things for the same sort of analysis

I think the sort of attributes this file would need for each step in a pipeline would be

inputs, output, program, version, command line.

It would be good to have something which also states the order of the steps

I know the sra/ena analysis xml allows for at least some of this but that is quite heavy weight so I am hoping for something custom format or using json syntax so it is both human readable aswell as allowing some programatic parsing

Before I specify something myself, is there a solution already which provides most if not all the functionality I want.

ChIP-Seq alignment RNA-Seq • 4.0k views

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.4 years ago by Laura ★ 1.8k

Entering edit mode

A well-written Makefile should be (mostly) self-documenting to a bioinformaticist.

ADD REPLY • link 10.3 years ago by Alex Reynolds 36k

Entering edit mode

Can you point to an example of a MAKE file that you think is good. I think of them for building software not for running analyses

ADD REPLY • link 10.3 years ago by Laura ★ 1.8k

Entering edit mode

I've just written simple one: https://github.com/lindenb/ngsxml

ADD REPLY • link 10.3 years ago by Pierre Lindenbaum 165k

Entering edit mode

10.4 years ago

Pierre Lindenbaum 165k

hum... I'm not sure I understand. Something like a Makefile contains all the recipe and the command lines but it can be hard to read... One could imagine to build a Makefile-based worklow using a XML+a XSLT stylesheet. See my **old** example: http://plindenbaum.blogspot.fr/2012/08/the-500th-post-generating-pipeline-of.html

The very same XML descriptor could be used to write a LATEX/Markdown documentation about the workflow..

UPDATE: I wrote this https://github.com/lindenb/ngsxml

2nd idea: A Galaxy pipeline can be exported as a JSON file ... see https://wiki.galaxyproject.org/ToolShedWorkflowSharing as far as I remember the output format is JSON.

ADD COMMENT • link 10.3 years ago by Pierre Lindenbaum 165k

Entering edit mode

This isn't necessarily meant to be something someone could use to run the pipeline but a description of the command lines used so if someone else wanted to install all the tools and rerun the process using their own pipeline infrastructure they could, or if you just want to compare how different the same steps are from different pipelines. I hope something like this will help me improve our README files for our analyses too

I trying to find an example of the galaxy json you mention but I couldn't come up with the right search terms. Do you have an example?

ADD REPLY • link 10.4 years ago by Laura ★ 1.8k

Entering edit mode

moved my comment to an anwer..

ADD REPLY • link 10.4 years ago by Pierre Lindenbaum 165k

Entering edit mode

10.3 years ago

Alex Reynolds 36k

Here is a good overview of using makefiles for bioinformatics analyses. A makefile is a way to write out a DAG of targets and dependencies.

To write a DAG in JSON is certainly doable (see convoluted example below) but for lack of finding anything prepackaged, it has seemed to me a lot more work, since I have to design my structure of ordered targets, dependencies and parameters as a nested set of JSON objects and lists, and I had to write all the external code to validate the graph structure and components, as well as process dependencies into target end products.

JSON is more readable, but it is also more verbose ("heavyweight") as a consequence. Making sure all the bits and pieces are in the pipeline document seems a strong prerequisite. If you want to go this route, you might consider looking into JSON Schema to design a "meta"-language or schema for your graph, which can be used to help ensure individual instances of a pipeline are correct before processing. You might write a schema, and then write a JSON-formatted pipeline that validates to your schema.

Here is one very rough example of such a schema document, which defines inputs (sets of genomic intervals, essentially), operations applied on those sets to create outputs, and a vocabulary of properties and parameters that might be useful for staging and processing (datetime stamp, ID fields, descriptive metadata, etc.):


	{
	"$schema": "http://json-schema.org/draft-04/schema#",
	"description": "A schema to describe a web request graph for BEDOPS operations, building a tree of individual operations, inputs and outputs",
	"properties": {
	"dtsubmission": {
	"description": "An RFC3339-formatted request graph submission timestamp in UTC time",
	"format": "date-time",
	"type": "string"
	},
	"id": {
	"description": "A unique identifier for a request graph instance",
	"type": "string"
	},
	"name": {
	"description": "Name of the operation request graph instance",
	"type": "string"
	},
	"operations": {
	"description": "List of request graph Operation components",
	"items": {
	"properties": {
	"id": {
	"description": "A unique identifier for this Operation",
	"type": "string"
	},
	"name": {
	"description": "Name of this Operation",
	"type": "string"
	},
	"parameters": {
	"description": "List of Operation parameters",
	"items": {
	"properties": {
	"arguments": {
	"description": "Operation parameter arguments",
	"items": {
	"properties": {
	"options": {
	"items": {
	"properties": {
	"kind": {
	"description": "Operation option kind",
	"enum": [
	"range_start",
	"range_stop",
	"filter_score",
	"filter_name",
	"filter_strand",
	"filter_chromosome",
	"set_range_left",
	"set_range_right"
	],
	"type": "string"
	},
	"value": {
	"description": "Operation option value",
	"type": "string"
	}
	},
	"type": "object"
	},
	"type": "array"
	},
	"sets": {
	"description": "List of Operation input or output sets (Element Set)",
	"items": {
	"properties": {
	"id": {
	"description": "A unique identifier for a defined input Element Set",
	"type": "string"
	},
	"kind": {
	"description": "Set input/output kind",
	"enum": [
	"input",
	"input_reference",
	"input_map",
	"output"
	],
	"type": "string"
	}
	},
	"required": [
	"id",
	"kind"
	],
	"type": "object"
	},
	"type": "array"
	}
	},
	"type": "object"
	},
	"type": "array"
	},
	"kind": {
	"description": "Operation kind",
	"enum": [
	"element_set_range",
	"element_set_filter_score",
	"element_set_filter_name",
	"element_set_filter_strand",
	"element_set_filter_chromosome",
	"element_set_union",
	"element_set_merge",
	"element_set_element_of",
	"element_set_not_element_of",
	"element_set_component",
	"element_set_difference",
	"element_set_symmetric_difference",
	"element_set_partition",
	"element_set_map_on_element_set",
	"element_set_attributes"
	],
	"type": "string"
	}
	},
	"required": [
	"kind",
	"arguments"
	],
	"type": "object"
	},
	"type": "array"
	},
	"summary": {
	"description": "Optional details about this Operation instance",
	"type": "string"
	}
	},
	"required": [
	"id",
	"name"
	],
	"type": "object"
	},
	"type": "array"
	},
	"sets": {
	"description": "List of request graph Element Set components",
	"items": {
	"properties": {
	"id": {
	"description": "A unique identifier for this Element Set",
	"type": "string"
	},
	"kind": {
	"description": "Set kind",
	"enum": [
	"element",
	"interaction"
	],
	"type": "string"
	}
	},
	"required": [
	"id",
	"kind"
	],
	"type": "object"
	},
	"type": "array"
	}
	},
	"required": [
	"id",
	"name",
	"dtsubmission"
	],
	"title": "operation request graph schema",
	"type": "object"
	}

view raw biostars-122378-1.json hosted with ❤ by GitHub

The following is an example of a JSON-formatted instance of a processing pipeline, which would validate against this schema. The goal is to show a graph that would take transcription start sites, filter them for belonging to the CTCF factor, and then apply the equivalent of a bedmap operation against them and a list of promoter windows of interest on chromosome 16:


	{
	"dtsubmission": "2014-08-13T00:21:17",
	"id": "abcd1234",
	"name": "Test request graph for finding CTCF TSSs that associate with promoters of interest on chr16",
	"operations": [
	{
	"id": "map_ctcf_TSSs_to_promoters",
	"name": "Map CTCF TSSs to promoters on chr16",
	"parameters": [
	{
	"arguments": [
	{
	"options": [
	{
	"kind": "filter_chromosome",
	"value": "chr16"
	}
	],
	"sets": [
	{
	"id": "0001_ctcf_TSSs",
	"kind": "input_map"
	},
	{
	"id": "0002_all_promoters",
	"kind": "input_reference"
	},
	{
	"id": "0003_promoters_which_overlap_ctcf_TSSs",
	"kind": "output"
	}
	]
	}
	],
	"kind": "element_set_map_on_element_set"
	}
	],
	"summary": "Build a list of promoters which overlap CTCF TSSs"
	},
	{
	"id": "element_filter_TSSs",
	"name": "Filter promoters by CTCF",
	"parameters": [
	{
	"arguments": [
	{
	"options": [
	{
	"kind": "filter_name",
	"value": "CTCF"
	}
	],
	"sets": [
	{
	"id": "0000_gene_TSSs",
	"kind": "input"
	},
	{
	"id": "0001_ctcf_TSSs",
	"kind": "output"
	}
	]
	}
	],
	"kind": "element_set_filter_name"
	}
	],
	"summary": "We take the gene TSSs and filter them on the transcription factor of interest"
	}
	],
	"sets": [
	{
	"id": "0000_gene_TSSs",
	"kind": "element"
	},
	{
	"id": "0001_ctcf_TSSs",
	"kind": "element"
	},
	{
	"id": "0002_promoters",
	"kind": "element"
	},
	{
	"id": "0003_promoters_which_overlap_ctcf_TSSs",
	"kind": "element"
	}
	]
	}

view raw biostars-122378-2.json hosted with ❤ by GitHub

It would be the job of whatever service parses this JSON request or payload to decide which genomic sets are inputs that exist (dependencies) and which are targets, yet to be made, which require backend processing steps.

There are various libraries written to process JSON and validate JSON against a JSON Schema document. In Python, for instance:

$ python
Python 2.7.6 (default, Jul  9 2014, 20:49:24) 
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.38)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import json
>>> from jsonschema import validate
>>> schema_fh = open("BEDOPSWebRequestSchema.json", "r")
>>> schema = json.load(schema_fh)
>>> test_request_fh = open("SampleWebRequestPayload.json", "r")
>>> test_request = json.load(test_request_fh)
>>> validate(test_request, schema)

If the request doesn't validate, a ValidationError exception is thrown with errors that point to the offending JavaScript object in the request. If the request validates, that doesn't mean there couldn't be problems with the schema, but it's a good start for testing and validation.

Maybe there is a suite of tools written that do all of this already, but I wasn't able to find one. Hopefully someone more knowledgeable will comment, or hopefully this post gives some ideas of what could potentially be done.

GNU Makefile seems to do a lot of the heavy lifting and the tools to process one are ubiquitous on UNIX systems like Linux and OS X, so it is perhaps reinventing the wheel to translate this system to another language. I mean, all that JSON above can basically be reduced to something like:

$ grep 'CTCF' TSS.bed | bedmap --chrom chr16 promoters.bed - > answer.bed

A makefile from this is not very long or complex to read:

all: ctcf_tss.bed answer.bed

ctcf_tss.bed:
    grep 'CTCF' TSS.bed > $@

answer.bed: ctcf_tss.bed
    bedmap --chrom chr16 promoters.bed $^ > $@

Further, if any dependency changes (say, the set of transcription start sites changes) then only targets downstream of changed dependencies get remade, which is more efficient. This could be done with a JSON-based approach, but it requires coding.

ADD COMMENT • link updated 3.1 years ago by Ram 45k • written 10.3 years ago by Alex Reynolds 36k