Entering edit mode
9.0 years ago
Dan D
7.4k
Has anyone used the tool described in this paper for compressing FASTQ data? I'm going to evaluate it when I get time, but I wanted to see if anyone has intel on it ahead of that. I'll report back here after giving it my assessment.
This paper has been temporarily withdrawn by the authors.
Just for reference, the journal website says: "This manuscript has been temporarily withdrawn at the request of the authors. The authors report that they have identified an error in the software. This withdrawal is to provide the authors with an opportunity to determine to what extent the reported results are affected by this error."
It would have been nice of the Journal to update the HTML article with the same notice.
Frankly this sounds like a serious problem beyond a bug of the method. Some methodological error on the evaluation: for example the files were actually larger and slower than before and they switched up the comparison. That can happen easily.
See edit: Then the fact that in 2015 the Bioinformatics journal publishes a software that is "available" from someone's webpage is saddening.
--- Edit ---
Actually scratch that (kind of) there is a github repo here:
https://github.com/mariusmni/lfqc
still at the time of publication there was no repository.
Very useful info, thanks!
If you haven't seen it (I hadn't until a few days ago, coincidentally), there was a compression challenge recently that evaluated a few related tools: http://www.pistoiaalliance.org/projects/sequence-squeeze/
An article describing the results is here, in case you wanted to investigate alternatives to lfqc.
Anyone familiar enough with Ruby out there who can comment on the code? It looks like it might be a wrapper around Mahoney's lpaq and zpaq tools, but I'm not certain. I can't get behind a paywall to read a preprint so maybe it is described in the paper?
I am not familiar with Ruby, but the code is easy to understand. I may be wrong, but I think the core algorithm is on line 15:
and line 99:
edit: the github code has an Apache license, but it is surprising the manuscript states "the implementations are freely available for non-commercial purposes", which seems to me incompatible with both zpaq and lpaq licenses.
The tar step is just packaging, not compression. It looks like it is a wrapper to Matt Mahoney's compression tools, applied on different pieces of a FASTQ record.
Yes, the tar only bundles together files, but it is important because it keeps things tidy. :-)
I glanced over the paper, for sequence and quality, it is practically just a wrapper for lpaq8 and zpaq, respectively - there is some processing (encoding
#
runs as a bit flag and removing newlines), but nothing original.The header line is "tokenized" (split into bits), tokens are compressed (RunLength encoding or Incremental encoding, or just reverse it, as they "observed that this tends to improve the compression ratio of the context mixing algorithm applied downstream". Then it is again compressed with zpaq.
Actually that was my impression too. I did not spend substantial time on details but honestly it just seemed like running some two existing methods even invoking them as command line applications. My first thought was, how is this a bioinformatics paper?