Hi,
I am developing some processing pipelines (Quality Control, Mapping, Variant Calling ...) for NGS data.
Certainly a bad behavior, but at the beginning you make some choices about your pipeline (tools used, order of tools, home-made code for a particular step, ...) and your pipeline gives you satisfaction for quite a while. But then one day, a brand new mapper arrives and you need to remove the old core mapping algorithm and plug in the new one (and you may need to adapt some processing around this tool as well). This new pipeline will generate new results with its own specificities and one has to be able to get back the exact sequence of processing tools used. The older pipeline is not going to be deleted and might be used again for some specific project.
There are several levels of changes that may need to be tracked. For instance if we look at a pipeline as a series of processing boxes, changes could be :
- A tool inside a box changes (
BWA
forStampy
for instance) - The version of a used tool in a box changes
- Parameters supplied to a tool change (
bwa sampe
forbwa sampe -s
) - Order of processing changes (
MarkDup
beforeIndelRealigner
or the opposite) - Others minor changes in the code that could slightly affect the behavior of pipeline
What changes, in your opinion, requires to be tracked and what are your experience/practice on this matter ? Is this essentially a manual task or some tools do exist to provide an automatic numbering?
How do you organize your code to enable this tracking ?
Thanks for your inputs,
T.
I would suggest you to always track all changes in your code. You never know when a bug will be introduced in your program/pipeline (or you wouldn't have introduced it!). As I see it, there are 2 kind of changes in a pipeline: i) changes in your own code and 2) changes in the code you only call/execute (for example, changing the version of a tool). Changes in your code can be easily tracked, but changes in the external tools should be explicitly tracked (for example by including the version of the tool in the path of the tool -- /path/to/tool-v1.1/tool -- or something similar).
A question to understand how to answer you better. Have you tried any version control system, like hg, git or cvs?
Yes, I am using SVN. The point here is to have in the database where the pipelines' results are stored the versions of pipelines used to generate them. This version should point to the exact tools and version of tools that have been used. I thought about using SVN commit version number, but as all the pipelines are under the same SVN repository, this number grows even with a single comma modified, so it did not appear appropriate to me to use SVN version number at first glance (but I may be wrong!). So I am bit lost on that ... and as several mapping pipelines should co-exist for instance, I do not even know if it is better to duplicate the code with another tool in a box and set a version number accordingly or if I should manage this inside the module itself with "if ... else" statements. Another possibility would be to say that I have a BWA-Mapping pipeline and a Stampy-Mapping pipeline and make only grow version number of these 2 pipelines rather than having a generic "Mapping" pipeline with customizable tools in some boxes. As you can see, I am kind of starting to think about it, and what I am looking for here is rather an advise of what is a good practice for these kind of things .. I would prefer not stepping forward in a wrong path.