Question

Tutorial:Fastq Quality Control And Reporting - Aka Fastqc Versus The New Contenders

21

Entering edit mode

11.7 years ago

Istvan Albert 101k

The venerable FastQC is many scientists' first choice to generate that first look of a sequencing data. The software created in 2010 runs as a simple command line or GUI tool and generates various statistics plots. It is simple and fast, the plots look good although many plots show quantities that are easy to misunderstand.

For example read qualities are grouped for longer reads, values are mysteriously normalized to 100% leading to wrong conclusions by those that don't notice the finer details. Moreover the software is not suited for paired end read analysis and the reporting mode is unwieldy when running on dozens of samples.

So naturally I was intrigued when noticing two new approaches published recently:

HTQC: a fast quality control toolkit for Illumina sequencing data published in BMC Bioinformatics, 2013
NGSUtils: a software suite for analyzing and manipulating next-generation sequencing datasets published in Bioinformatics, 2012

This post is an evaluation of how they work based on first hand experiences performed this morning, starting from a basic task of having to evaluate a dataset with 24 million reads. Let me note that these happen to be really short reads by today's standards at only 40 bases long.

My benchmark is what I would normally do:

$ time fastqc data/28824.fq 
... removed ...    
It seems our guess for the total number of records wasn't very good.  Sorry about that.
... removed ...
real    2m25.152s
user   2m27.437s
sys   0m2.023s

Let me get something off my chest here. I have seen that apology above so many times - I can't recall the last time I did not see it. The only effect it has on me is to make me wonder: just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters.

Oh well, why am I even mentioning that ... fastqc runs well in 2 minutes 25 seconds, generates quite a few plots and the maximum amount of memory used was about 180MB

Now onto bigger and better tools that will blow this old whale out of the water.

Installation

HTQC: turns out it needs CMake to install. Well CMake is supposed to make (pun intended) life easier, alas in my experience in only means more trouble. Sure enough the version of my current CMake is not good enough, it needs a higher version. Now I need to download and install that. Manually of course since the package manager for the so called Scientific Linux does not have that. Thanks a bunch. Ok done. After that it compiles fine.

NGSUtils: I do a git clone and type make, it goes to town furiously downloading and compiling a lot of resources that already exist on my computer. But it does finish its job although leaves me with an uncertain feeling as to whether or not it has modified anything that is already there.

Runtimes

The paper for HTQC claims it to be three times FASTER than FastQC and claims to use a lot less memory as well.

Let's run it. Turns out you really need the -q option otherwise there is a message printed every 5000 lines. I must say default behavior like is not all that reassuring.

time ~/src/htqc-0.11.1/build/ht_stat -q --out report data/28824.fq
real    5m58.007s
user    5m48.289s
sys    3m16.106s

The observed runtime is more than two times SLOWER !!!, and while running the program used 1.9GB !!! of memory.

Alas there is more, this does not actually generate plots only datasets. To get the plots one needs to run a separate program that invokes gnuplot. Great I have that already installed. Running the tool fails with a mysterious error "font not a valid variable". Internet sleuthing indicates that this error occurs when making use of features that are only available in the latest GnuPlot version 2.6. My package manager does not have this version (of course) of so it needs to be installed manually. Oh well, I did that too but my patience is running thin. Run already:

time ~/src/htqc-0.11.1/ht_stat_draw.pl --dir report
???

The process does not seem to finish! Some plots are generated but the command does not return. The plots that the command generates are very ugly and look wrong, the bases extend to the 100 range even though the reads are only 40 bases long. So far it does not look good at all.

Let's try to other contender: NGSUtils has a command called fastqutils stats we invoke it like so:

~/src/ngsutils/bin/fastqutils stats data/28824.fq

Facepalm moment ensues, this script prints information to the standard output for every single read that it investigates. The interface is made to look slick. There is a little rotating pipe that show that the program runs, the name of each read is printed and it continuously computes information such as the ETA and percent done. It is also insane slow because the speed at which this tool runs is equal to the speed of writing characters to the screen. No wonder the ETA indicates 24 minutes.

So there you have: it two recently published tools each claiming to do something better whereas in practice they are immensely inferior to much older tools and techniques. Perhaps Fred was onto something A farewell to bioinformatics

quality-control fastqc • 14k views

ADD COMMENT • link updated 22 months ago by Ram 44k • written 11.7 years ago by Istvan Albert 101k

6

Entering edit mode

This is also informative about the absurdly low bar set for tool papers. This makes me question the viability of tool papers altogether.

ADD REPLY • link 11.7 years ago by Jeremy Leipzig 22k

1

Entering edit mode

My first tool-related manuscript, which was rejected, was even worse than HTQC and ngsutils. I had invested quite a lot of time and thought I was doing for good, but when I looked back, my that tool is just something that I criticize every day. You are right that we should be cautious of tool papers, but most of these authors are trying to write something useful and more importantly they need a publication to graduate.

ADD REPLY • link 11.7 years ago by lh3 33k

2

Entering edit mode

One could argue that the responsibility is on the reviewers - but then they are not directly rewarded for a job well done and the incentives to do be thorough are lacking. So in the end it is the system that is to blame - too many people need to write too many papers to move ahead on the pre-determined social ladder.

ADD REPLY • link 11.7 years ago by Istvan Albert 101k

0

Entering edit mode

agreed, and sometimes a tool I've written makes complete sense to me and runs flawlessly, but others seem to find a way to use it in a way that I haven't foreseen (or cared about) so to them it is frustrating and useless. (try installing scipy if you're not a seasoned linux user)

ADD REPLY • link 11.7 years ago by brentp 24k

0

Entering edit mode

I do agree. How many times (I am not going to cite the tools) in my team we have been lost into tool "bugs". We have spent a incredible amount of time running a tool to see it crash in the end, and for some tools it could fail after one week. You also sometimes report this to the developer and you suddenly learn that there are quite a few things to improve to make it work really properly. Well, thank you very much. But sure, the paper claimed that it was going to be the inevitable tool of the year.

ADD REPLY • link 11.7 years ago by toni ★ 2.2k

2

Entering edit mode

I do not like CMake, either. The right building system should not require end users to install anything but the basic unix tools. Also, developers should provide Linux binary for tools that depends on non-standard libraries or is complex to build.

ADD REPLY • link 11.7 years ago by lh3 33k

0

Entering edit mode

Thanks for the checking them out.

Another way to look at raw data tileQC. It produces tiles and one can quickly look for the spots which are of bad quality. This is just useful to have a glimpse of data as a whole (good/bad). Its available as R and Python version (python is supposedly faster)

ADD REPLY • link updated 22 months ago by Ram 44k • written 11.7 years ago by Sukhi Singh 11k

0

Entering edit mode

Have you run this yourself? Seems to create a mysql database for each dataset - I don't see that as feasible for today high throughput

ADD REPLY • link updated 22 months ago by Ram 44k • written 11.7 years ago by Istvan Albert 101k

0

Entering edit mode

Yes, but I have ran the R version, Didn't create that there.

ADD REPLY • link updated 22 months ago by Ram 44k • written 11.7 years ago by Sukhi Singh 11k

0

Entering edit mode

Thanks for the post - very informative and good to see that sticking with FastQC out of convenience is also the smarter choice :-)

ADD REPLY • link 11.7 years ago by Sebastian Kurscheid ▴ 300

Ram · Answer 1 · 2013-02-26

Regarding I/O performance in FASTQ (and other) tools

My name was (wrongly) mentioned, so I feel fine to chime in.

First - many thanks to Aaron for his kind words, but FastQC was written by Simon Andrews from The Babraham institute. I developed other tools called "fastx-toolkit" long ago, but those were not mentioned in the above post (they are pretty fast, though :) ).

Now, when it comes to processing FASTQ files:

They are usually huge - meaning lots of I/O
The are sometimes compressed, which complicates matters.
Their format is (almost) fixed simple, except when it's not.

On top of that, there's the usual problem that when you write code, it's better if you actually know how to program, and understand the intricate details of what's going on.

Sadly, many "bioinformaticians" are more "informaticians" and less programmers.

With the two programs that you've mentioned:

"htqc" is written in CPP, trying pretty OO design, but falls short when it does I/O.

The main I/O loop in htqc-0.12.0-source/htio/PlainFileHandle.cpp is this:

while ((curr_char = fgetc(handle)) != EOF)
{
     if (curr_char == LINE_SEPARATOR) break;
     buffer.push_back(curr_char);
}

Meaning it reads the (huge) FASTQ file character-by-character, and adds it to std::vector.

Sorry - this will never perform well.

"NGSUtils" is written in Python (yuck) which already means performance will not be great.

I'm a Perl person, but I don't claim Perl will do better. With raw I/O and simple logic (as in most simlpe Fastq stats), you'll want to do it in a more low-level programming language.

But regardless of Python, the main I/O loop again falls short.

In ./ngsutils-ngsutils-0.5.0c/ngsutils/fastq/__init__.py, the loop is:

while True:
    try:
        name = self.fileobj.next().strip()[1:]

        spl = re.split(r'[ \t]', name, maxsplit=1)
        name = spl[0]
        if len(spl) > 1:
            comment = spl[1]
        else:
            comment = ''

        seq = self.fileobj.next().strip()
        self.fileobj.next()
        qual = self.fileobj.next().strip()

        If eta:
            eta.print_status(name)
        yield FASTQRead(name, comment, seq, qual)

        except:
            break

Which means that on top of Python's (slow) I/O, it does string.strip() on three lines and a regular-expression string.split() every four lines.

This again - will not perform well, ever.

Also checking "eta" and printing the progress report on EVERY read is wasteful.

As a programmer, I understand the temptation to use high-level constructs (e.g. regexp split) to do something useful in very few lines, but one has to understand that it comes with a price.

No doubt using something like C's strpos() to find the space in the ID string is more cumbersome, but it's WAY faster.

Sadly, Peer-reviewed bioinformatic journals don't do proper code-review - so sub-par code is regularly published.

Regarding your (justified) rant:

"just why exactly is it so difficult in guess the number of lines? After all each line has the exact same length and is composed of ASCII letters" ?

Not all lines have the same length:

The ID line is subtly changed, because they lane/tile/X/Y values are reported as numbers.

e.g.:
```
@HWI-M1:36:000000000-A2R1H:1:1101:14988:1368 1:N:0:NTCAC
@HWI-M1:36:000000000-A2R1H:1:109:99:7 1:N:0:NTCAC
@HWI-M1:36:000000000-A2R1H:1:432:9:1111 1:N:0:NTCAC
```
So while you can approximate the size based on the average size of the first 4 lines (or the first 1000 lines) and divide by the file size, it won't be accurate.
Most FASTQ files are stored compressed.

Many tools can either read them and automatically uncompress them, or used with pipes - so they don't know the FASTQ file size in advance.
Even now and them you encounter a corrupt file - trucated during upload/download, or bad disk, or whatever - in those cases it's better to have a tool that scans the entire file and verifies that everything is OK (I got many "bug report" emails that turned out to be bad input files).

So while a tool that does guestimate the number of reads is possible - I guess nobody bothered to write one.

As a shameless plug, I wrote a tool that exactly that (just count reads in FASTQ files), and it's fast enough for me (not necessarily faster than other tools): https://github.com/agordon/fastx_read_count

Regards,
gordon