There is a lot of tools out there, very useful for command line usage, and very widely used in Bioinformatics, which rapidly turns out to be annoying (may be, sometimes) if we are writing a pipeline that cares about I/O connection and each tasks' exit status.
I am writing a pipeline using samtools, and samtools turns out to be a little bit annoying in the I/O management, because sometimes it generates an output file, but you don't really explicitly name that file. Sometimes, other tools don't even prompt for output files, or some other tools ask users to provide paths literally which adds up more turnarounds that need to be introduced and this can be a bit frustrating. Here is an example using samtools. I am wrapping a call to samtools on a file that does not exist, the command is failing but the exit code is still zero which is a bit misleading if we care about reporting the status of the entire pipeline, which means here it means that the sort went Ok and this will trigger other tasks, which is wrong
I am cross posting to CodersCrowd as well with a code you can run on the browser: http://coderscrowd.com/app/codes/view/288. You can see that either the status coming back from the docker image and the one coming back from the python interpreter itself (which is basically samtools exit status) is being zero, and it shouldn't be.
In the particular case of samtools, I would think that if you're using python anyway then it might be more convenient to just use pysam. Then the errors can be more easily caught in the base python script (at least in theory).
I agree that predictable exit codes are convenient, but there are so many other ways to determine whether or not it's safe to proceed in your pipeline for a given step. In this case, since you're using
samtools sort
, why not just check the size of the expected output file? You can set the name of the sorted bam to be whatever you want (samtools
automatically appends .bam to whatever name you choose).That's possible in case we do know the expected output file, sometimes it is not possible to predict, I mean if the branch of that pipeline is itself subject to a condition
Not being able to predict a filename is a problem with the coding/construction of the pipeline, not something truly intractable.
this is a very common situation, one that does not really have a solution other than working around the problems in various inconvenient ways - as for the causes: there is very little incentive and few rewards for writing code that behaves as it should
I agree with you Istvan, this is basically what we have to do on a daily basis (working around the problems) but these are basics of soft eng (being able to distinguish stderr / stdout / managing exit codes etc ..), and for reproducibility sake one should be able to just look at the log to get sense of what's going on in the pipeline. Incentives of getting the right exit code from a program is just doing this the right way or I should say as much right as possible :)