A quick and basic question today. I often see in literature (in particular in the context of NGS) the words "pipeline" and "workflow" used alternatively. Is there a real difference between those?
A quick and basic question today. I often see in literature (in particular in the context of NGS) the words "pipeline" and "workflow" used alternatively. Is there a real difference between those?
From IT and C/S usage:
A pipeline is a series of processes, usually linear, which filter or transform data. The processes are generally assumed to be running concurrently. The data flow diagram of a pipeline does not normally branch or loop. The first process takes raw data as input, does something to it, then sends its results to the second process, and so on, eventually ending with the final result being produced by the last process in the pipeline. Pipelines are normally quick, with a flow taking seconds to hours for end-to-end processing of a single set of data.
Examples of pipelines in the real world include chaining two or more processes together on the command line using the '|' (pipe) symbol, with results in stdout or redirected to a file, or a simple software build process driven by 'make'.
A workflow is a set of processes, usually non-linear, often human rather than machine, which filter or transform data, often triggering external events. The processes are not assumed to be running concurrently. The data flow diagram of a pipeline can branch or loop. There may be no clearly defined "first" process -- data may enter the workflow from multiple sources. Any process may take raw data as input, do something to it, then send its results to another process. There may be no single "final result" from a single process; rather, multiple processes might deliver results to multiple recipients. Workflows can be complex and long-lived; a single flow may take days, months, or even years to execute.
Examples of workflows in the real world include document, bug, or order processing, or iterative processing of very large data sets, particularly if humans are in the loop.
These terms have become mixed in recent years, in part because pipelines can be implemented as a very simple subset of workflows. In previous decades, workflow software was large, complex, commercial, and involved high licensing fees, while pipelines were a thing you did on the fly or in a shell script. The terminology has become more blurred as simpler "workflow" software packages have emerged; some of these are really just complicated versions of distributed 'make', and don't support humans in the loop. They really should have been called "data flow" rather than workflow packages. Likewise, there have been more efforts to support branching, looping, and suspended flows in "pipeline" libraries for various languages, and we've seen more pipelines spread over multiple machines, with data transport via HTTP, other TCP protocols, or shared networked filesystems.
A pipeline could just be a bunch of commands embedded in a build script.
When I hear workflow I think exclusively of a heavyweight platform like Taverna that is designed to make it easy for end users to use modular units to construct analyses. Of course, Pipeline Pilot also falls into this category, so it appears I might be the only one who makes this assumption.
http://en.wikipedia.org/wiki/Bioinformatics_workflow_management_systems
I would tend to think there is little difference but I do use these terms in slightly different ways.
I use 'pipeline' to refer to an established (often large) workflow (e.g. the Ensembl pipeline) that may have flow control built-in.
I use the term 'workflow' as a series of computational steps, usually programmed to run at once but sometimes just their conception notion is enough to refer to it as such.
I suspect that in practice there's not a lot to it, and the difference in usage maybe to do with the background of the speaker. For example, in my usage a workflow is a more formal, strict and computational term than pipeline. If I had to justify that, certain (non-bioinformatic) software systems have workflows meaning that documents and data move automatically from stage to stage, which is not far from Galaxy's series of analysis steps. But they're foggy terms.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
1st world problems :D
Looks like the consensus will be: no consensus