The venerable FastQC is many scientists' first choice to generate that first look of a sequencing data. The software created in 2010 runs as a simple command line or GUI tool and generates various statistics plots. It is simple and fast, the plots look good although many plots show quantities that are easy to misunderstand.
For example read qualities are grouped for longer reads, values are mysteriously normalized to 100% leading to wrong conclusions by those that don't notice the finer details. Moreover the software is not suited for paired end read analysis and the reporting mode is unwieldy when running on dozens of samples.
So naturally I was intrigued when noticing two new approaches published recently:
- HTQC: a fast quality control toolkit for Illumina sequencing data published in BMC Bioinformatics, 2013
- NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets published in Bioinformatics, 2012
This post is an evaluation of how they work based on first hand experiences performed this morning, starting from a basic task of having to evaluate a dataset with 24 million reads. Let me note that these happen to be really short reads by today's standards at only 40 bases long.
My benchmark is what I would normally do:
$ time fastqc data/28824.fq
... removed ...
It seems our guess for the total number of records wasn't very good. Sorry about that.
... removed ...
real 2m25.152s
user 2m27.437s
sys 0m2.023s
Let me get something off my chest here. I have seen that apology above so many times - I can't recall the last time I did not see it. The only effect it has on me is to make me wonder: just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters.
Oh well, why am I even mentioning that ... fastqc runs well in 2 minutes 25 seconds, generates quite a few plots and the maximum amount of memory used was about 180MB
Now onto bigger and better tools that will blow this old whale out of the water.
Installation
HTQC: turns out it needs CMake to install. Well CMake is supposed to make
(pun intended) life easier, alas in my experience in only means more trouble. Sure enough the version of my current CMake is not good enough, it needs a higher version. Now I need to download and install that. Manually of course since the package manager for the so called Scientific Linux does not have that. Thanks a bunch. Ok done. After that it compiles fine.
NGSUtils: I do a git clone
and type make
, it goes to town furiously downloading and compiling a lot of resources that already exist on my computer. But it does finish its job although leaves me with an uncertain feeling as to whether or not it has modified anything that is already there.
Runtimes
The paper for HTQC claims it to be three times FASTER than FastQC and claims to use a lot less memory as well.
Let's run it. Turns out you really need the -q option otherwise there is a message printed every 5000 lines. I must say default behavior like is not all that reassuring.
time ~/src/htqc-0.11.1/build/ht_stat -q --out report data/28824.fq
real 5m58.007s
user 5m48.289s
sys 3m16.106s
The observed runtime is more than two times SLOWER !!!, and while running the program used 1.9GB !!! of memory.
Alas there is more, this does not actually generate plots only datasets. To get the plots one needs to run a separate program that invokes gnuplot. Great I have that already installed. Running the tool fails with a mysterious error "font not a valid variable". Internet sleuthing indicates that this error occurs when making use of features that are only available in the latest GnuPlot version 2.6. My package manager does not have this version (of course) of so it needs to be installed manually. Oh well, I did that too but my patience is running thin. Run already:
time ~/src/htqc-0.11.1/ht_stat_draw.pl --dir report
???
The process does not seem to finish! Some plots are generated but the command does not return. The plots that the command generates are very ugly and look wrong, the bases extend to the 100 range even though the reads are only 40 bases long. So far it does not look good at all.
Let's try to other contender: NGSUtils has a command called fastqutils stats
we invoke it like so:
~/src/ngsutils/bin/fastqutils stats data/28824.fq
Facepalm moment ensues, this script prints information to the standard output for every single read that it investigates. The interface is made to look slick. There is a little rotating pipe that show that the program runs, the name of each read is printed and it continuously computes information such as the ETA and percent done. It is also insane slow because the speed at which this tool runs is equal to the speed of writing characters to the screen. No wonder the ETA indicates 24 minutes.
So there you have: it two recently published tools each claiming to do something better whereas in practice they are immensely inferior to much older tools and techniques. Perhaps Fred was onto something A farewell to bioinformatics
This is also informative about the absurdly low bar set for tool papers. This makes me question the viability of tool papers altogether.
My first tool-related manuscript, which was rejected, was even worse than HTQC and ngsutils. I had invested quite a lot of time and thought I was doing for good, but when I looked back, my that tool is just something that I criticize every day. You are right that we should be cautious of tool papers, but most of these authors are trying to write something useful and more importantly they need a publication to graduate.
One could argue that the responsibility is on the reviewers - but then they are not directly rewarded for a job well done and the incentives to do be thorough are lacking. So in the end it is the system that is to blame - too many people need to write too many papers to move ahead on the pre-determined social ladder.
agreed, and sometimes a tool I've written makes complete sense to me and runs flawlessly, but others seem to find a way to use it in a way that I haven't foreseen (or cared about) so to them it is frustrating and useless. (try installing scipy if you're not a seasoned linux user)
I do agree. How many times (I am not going to cite the tools) in my team we have been lost into tool "bugs". We have spent a incredible amount of time running a tool to see it crash in the end, and for some tools it could fail after one week. You also sometimes report this to the developer and you suddenly learn that there are quite a few things to improve to make it work really properly. Well, thank you very much. But sure, the paper claimed that it was going to be the inevitable tool of the year.
I do not like CMake, either. The right building system should not require end users to install anything but the basic unix tools. Also, developers should provide Linux binary for tools that depends on non-standard libraries or is complex to build.
Thanks for the checking them out.
Another way to look at raw data tileQC. It produces tiles and one can quickly look for the spots which are of bad quality. This is just useful to have a glimpse of data as a whole (good/bad). Its available as R and Python version (python is supposedly faster)
Have you run this yourself? Seems to create a mysql database for each dataset - I don't see that as feasible for today high throughput
Yes, but I have ran the R version, Didn't create that there.
Thanks for the post - very informative and good to see that sticking with FastQC out of convenience is also the smarter choice :-)