Hi All,
I am starting this discussion to know a general view point about the annoying struggle of compressing and decompressing fastq file (as all the analysis in NGS starts with this file). While it is understood that compression is important in order to save space, there are a couple of routine problems I face where a considerable amount of time is wasted in either compressing or decompressing fastq files.
Now, for basic analysis like trimming, cleaning, taking the fastq stats, tools can be classified into below categories:
- tools which only work on compressed fastq files (.gz)
- tools which only work on decompressed fastq files
- tools which work on decompressed fastq files and themselves decompress files before analysing.
- tools which work on both compressed and decompressed files (e.g trimmomatic, fastqc)
Isn't it there is a need of unanimous protocol/guidelines to design tools which work on compressed fastq files?
33% of bioinformatics is just dealing with tool-quirks. We would all like standards, but then you know what happens:
All tools that work on compressed file will decompress it (in memory) before analyzing it - the performance of that decompression may in turn also vary as well. One complication here is that one has to "pay" the cost of decompression each time a tool is run on a compressed data - this though may not a problem since the process may likely to be IO bound rather than CPU bound - though when running tools in highly parallel fashion this may change.
You should be working with pipes (if the tool accepts) or with "bash process substitution" like this:
This is occasionally supported by pre-processing tools, so not that much helpful.
What do you mean by occasional support?
<(zcat myfasta.gz)
construct works like the unzipped content is coming from a normal file.you can add to this list uBAM file, which should be more convenient than FASTQ file
Some example tools for every problem would give an idea how frequently people use that tool. For some basic operations
unzipping | operation | zipping
is the usual case. But having a tool working with compressed file is a good idea.