Tutorial:Minimum Standards For Bioinformatics Command Line Tools
3
59
Entering edit mode
11.2 years ago
Medhat 9.8k

Hi,

I found this article, I think It is a nice one for developing a command line tools.

Source Minimum standards for bioinformatics command line tools

  • Print something if no parameters are supplied

Unless your tool is a filter which works by manipulating stdin to stdout, you should always print out something (some help text, ideally) if the user runs your tool without all the required parameters. Just exiting quietly isn't helping anyone.

% biotool
Please use the --help option to get usage information.
  • Always have a "-h" or "--help" switch

The Unix tradition is for all commands to have a "-h" or "--help" switch, which when invoked, prints usage information about the command. Most languages come with a getopt() type library, so there is no excuse for not supporting this.

% biotool -h
Usage: biotool [options] <file.fq>
Options:
--rc       reverse complement
--trim nn  trim <nn> bases from 3' end first
--mask     remove vector sequence contaminant
  • Have a "-v" or "--version" switch

Many bioinformatics tools today are used as part of larger pipelines, or put into the Galaxy toolshed. Because compatibility is dependent on the version of your tool being used, you should have a simple, machine-parseable way to identify what version of tool you have.

% biotool --version
biotool 1.3a
  • Use stderr for messages and errors

If you need to print an error message, are just printing out progress or log information, try and use stderr rather than stdout. Try to reserve stdout for use as your output channel, so that it can be used in Unix pipes to avoid temporary files.

% biotool reads.fq | fq2fa > clean.fq
biotool: processing reads.fq
fq2fa: converted 423421 reads
  • Validate your parameters

If you have command line options, do some validation or sanity checking on them before letting them through to your critical code. Many getopt() libraries support basic validation, but ultimately it is not that difficult to have a preamble with some "if not XXX { print ERROR ; exit }" clauses.

% biotool --trim -3 reads.fq 
Error: --trim must be an integer > 0
  • Don't hard-code any paths

Often the tool you write depends on some other files, such as config files or database/model files. The easiest, but wrong and annoying, thing to do is just put

% biotool --mask reads.fq 
Error: can't load /home/steven/work/biotool/data/vector.seq
# ARRRGGGGHHH!
  • Don't pollute the command-line name space

You've come up with a new tool called "BioTool". The command you want everyone to invoke is called "biotool", but it is just a master script which runs lots of other tools. Unfortunately you used lots of generic names like "fasta2fastq", "convert", "filter" .. and so on, and you've put them all in the same folder at the main "biotool" script. So when I install BioTool, my PATH gets filled with rubbish. Please don't do this.

% ls -1 /opt/BioTool/
biotool 
convert      # whoops, clashes with ImageMagick!
load-hash.py # hello Titus :-)
filter
diff         # whoops, clashes with standard Unix tool!
test.sh      # <face-palm>

The first solution is to prefix all your sub-tools and helper scripts with "biotool". The second solution, if they are scripts only, is to not make them executable (so they don't go in PATH) and invoke the via the interpreter (perl, python, ...) explicitly from biotool. The third solution is too put them all in a separate folder (eg. auxiliary/, scripts/ ...) and explicitly call them (but take note of #6 above).

  • Don't distribute bare JAR files

If your tool is written in Java and is distributed as a JAR file, please write a simple shell wrapper script to make it simple to invoke. The three lines below are all you need (in the simple case) and you will make your users much happier.

#!/bin/bash
PREFIX=$(dirname $0)
java -Xmx500m -jar $PREFIX/BioTool.jar $*
  • Check that your dependencies are installed

I've installed BioTool, and I start running it, and all looks good. Then 2 hours later it spits out an error like "error: can't run sff2CA". This could all be avoided if biotool checked all the external tools it needed before it commenced, and save your users associating your software with pain.

% biotool --stitch R1.fq R2.fq
This is biotool 1.3a
Loaded config
Checking for 'bwa': found /usr/bin/bwa
Checking for 'samtools': ERROR - could not find 'samtools'
Exiting.
  • Be strict if you are still a Perl tragic like me

If you're old like me and Perl is still your native tongue, at least play it a little bit safer by starting all your scripts with the following lines:

#!/usr/bin/env perl
use strict;
use warnings;
use Fatal;

Update

Put in your consideration this when It comes to write documentation

Top considerations for creating bioinformatics software documentation

command-line • 9.7k views
ADD COMMENT
1
Entering edit mode

very nice article, thank you! Let's hope that many people will adopt these guidelines.

ADD REPLY
0
Entering edit mode

This is great, thanks!

ADD REPLY
3
Entering edit mode
11.2 years ago

Always rise an error if something goes wrong

The worst thing is when a command line tool fails for whatever reason, but doesn't return any error, so there is no way to know that the output is not complete.

A famous case in bioinformatics is tabix, which doesn't return any error when a download is interrupted for a network error. For example, try to download any file from 1000 Genomes using tabix, and manually interrupt the connection during the download (e.g. unplug your ethernet cable):

tabix  -h ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/phase1/analysis_results/consensus_call_sets/snps/All.chr22.LC1094_UNION_GL.20101123.snp.low_coverage.genotypes.vcf.gz

You will see that no error is given (see tabix bug report 59).

ADD COMMENT
3
Entering edit mode
9.3 years ago

Great suggestions, thanks (I found this thread now only through this What are your "model" examples for bioinformatics programming?). I would add

Allow reading input from stdin and by default write output to stdout

(If reasonably possible of course, I wouldn't twist the underlying algorithms just to allow that)

ADD COMMENT
1
Entering edit mode
  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new "features".

  2. Expect the output of every program to become the input to another, as yet unknown, program.

McIlroy M.D., Pinson E.N., and Tague B.A. (1978). Unix Time-Sharing System Foreword. The Bell System Technical Journal. 57(6)

ADD REPLY
1
Entering edit mode
7.4 years ago

One minor point: Error messages should identify which program is generating the message. For example, both the shell and the program might generate a message, so it should be clear which one is responsible. To illustrate from an example above:

% biotool --mask reads.fq 
Error: can't load /home/steven/work/biotool/data/vector.seq

would be more informative as

% biotool --mask reads.fq 
biotool: can't load /home/steven/work/biotool/data/vector.seq

In fact one of the other examples above did just that. This practice is especially important in pipelines when scripts call scripts which call binaries, etc.

ADD COMMENT

Login before adding your answer.

Traffic: 948 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6