ALAPY Compressor (Update version 1.3 added MacOS X support and 10-20% improved speed and compression ratio)
High throughput lossless genetics data compression tool.
Compress .fastq or .fastq.gz FILE to .ac format or decompress .ac FILE to fastq.gz file with the exact copy of the original fastq file contents. By default, compress or decompress FILE in-place leaving original file intact. By default lossless ALAPY Compressor algorithm is used. You can specify output directory (this directory must exist as the program will not create it). You can also change output file name. By default, the program outputs progress information to stdout but it can be suppressed.
HOW TO GET THE TOOL
To get the latest version please visit http://alapy.com/services/alapy-compressor/ website, scroll down to Download section, select your system by clicking on “for Windows” or “for Unix” (version for Mac OS is coming), read EULA http://alapy.com/alapy-compressor-eula/ , click on the checkbox. The DOWNLOAD button will appear. Click on it to download the tool
Also, all versions of ALAPY compressor available for free on the GitHub https://github.com/ALAPY/alapy_arc and EULA is the same and this is free software tool.
There are paid versions of the compressor with extended functionality and services. Please feel free to ask about them.
VERSIONS
Version 1.3.0
- Added MacOS X support (10.12 Sierra and above)
- Optimized compression on "medium" and "fast" levels (now .ac ~10-15% smaller than with 1.2.0 version)
- Added "experimental" option for compression dictionary optimization (-a/--optimize_alphabet), which improves compression speed (up to 20%)
- Optimized error handling
Version 1.2:
- added compression level option (-l /--level):
- best - best compression ratio, .ac file is1.5-4 times smaller than with gzip, but 4-6 times slower than gzip, requires 1500MB of memory,
- medium - medium level of compression (3-5% bigger .ac file than on best level), 1.2-1.8 slower than gzip, requires 100MB of memory, default
- fast - fastest compression, 0.5-1.4 of gzip speed, .ac file is 4-10% bigger than on best level, equires 100MB of memory.
Version 1.1:
- Added ability to output results of decompression to stdout (see help for the -d / - decompress optio)
- Added ability to compress data from stdin (see help for the -c / - compress option)
- Changed input validation module for stdin/stdout support
- Improved synchronization of threads (compression speed increased on average by 15%)
- Changed data decompression module (decompression speed on average increased by 30%)
- Optimized intermediate data decompression to write directly to output file or stdout
- Fixed end of line characters handling
- Fixed comparison of reads’ headers and comments
Version 0.0:
- Initial public beta version
INSTALLATION
This tool is already compiled for Unix and Windows. Make it executable and put in the PATH or run as is from its directory.
USAGE
alapy_arc [OPTION] [FILE] [OPTION]...
OPTIONS
The options for the program are as follows:
-h --help
Print this help and exit
-v --version
Print the version of the program and exit
-c --compress
Compress your fastq or fastq.gz file to ALAPY Compression format .ac file.
-d --decompress
Decompress your .ac file to fastq or fastq.gz file.
-o --outdir
Create all output files in the specified output directory. Please note that this directory must exist as the program will not create it. If this option is not set then the output file for each file is created in the same directory as the file which was processed. If the output file already exists in place the name of the output file will be changed by adding the output file version.
-n --name
Rename your file after progress
-q --quiet
Suppress all progress messages on stdout and only report errors.
EXAMPLES
alapy_arc --compress your_file.fastq.gz --outdir ~/alapy-archive --name renamed_compressed_file --quite
This will compress your_file.fastq.gz to renamed_compressed_file.ac in the alapy-archive directory in your home folder if alapy-archive directory exists. If renamed_compressed_file.ac is already present there, a file with add version will be written to alapy-archive directory
alapy_arc -d your_file.ac
This will decompress your_file.ac (ALAPY Compressor format) into your_file.fastq.gz in the same folder. If file with your_file.fastq.gz name exists already, then file version will be added.
alapy_arc_1.1 -d your_file.fastq.ac - | fastqc /dev/stdin
bwa mem reference.fasta <(alapy_arc_1.1 -d your_file_R1.fastq.ac - ) <(alapy_arc_1.1 -d your_file_R2.fastq.ac - ) > your_bwa_mem.SAM
These are examples of piping in general. Note that these are not POSIX and process substitution <(...) is implement in bash, not in sh. Some programs support reading from stdin natively. Read their help and/or manuals. For example FastQC supports it this way:
alapy_arc_1.1 -d your_file.fastq.ac - | fastqc stdin
You may find more about ALAPY Compressor usage on our website http://alapy.com/faq/ (select ALAPY Compressor as a relevant topic.
PIPE-ability std/stdout testing
Now with stdin/stdout support, you can use fastq.ac in your pipes, so there is no need to generate fastq or fastq.gz on your hard drive. You can start with fastq.ac use FastQC, then Trimmomatic, Trimgalore or CutAdapt, then double check with FastQC, use BWA or Bowtie2 in a pipe. This is what we have tested rigorously. Some tools support stdin or - as a parameter, named pipes, process substitution and /dev/stdin are the other ways to use fastq.ac in your pipes. Here is the testing summary where + sign shows support as tested, while - sign shows no high-quality support:
tool subcommand command line stdin /dev/stdin - (as stdin) <(…) process substitution comment
fastqc 0.11.5 . "alapy_arc_1.1 -d test.fastq.ac - | fastqc stdin" + + - + recommend by authors
fastqc 0.11.5 . "alapy_arc_1.1 -d test.fastq.ac - | fastqc /dev/stdin" + + - + .
bwa 0.7.12-5 mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta /dev/stdin > aln_se.sam" - + + + .
bwa 0.7.12-5 mem "alapy_arc_1.1 -d test.fastq.ac - | bwa mem hg19.fasta - > aln_se_1.sam" - + + + .
bwa 0.7.12-5 mem (PE reads) "bwa mem hg19.fasta <(alapy_arc_1.1 -d test_R1.fastq.ac -) <(alapy_arc_1.1 -d test_R2.fastq.ac -) > aln-pe2.sam" - - - + paired end
bwa 0.7.12-5 aln "alapy_arc_1.1 -d test.fastq.ac - | bwa aln hg19.fasta /dev/stdin > aln_sa.sai" - + + + .
bwa 0.7.12-5 samse "alapy_arc_1.1 -d test.fastq.ac - | bwa samse hg19.fasta aln_sa.sai /dev/stdin > aln-se.sam" - + + + .
bwa 0.7.12-5 bwasw "alapy_arc_1.1 -d SRR1769225.fastq.ac - | bwa bwasw hg19.fasta /dev/stdin > bwasw-se.sam" - + + + long reads testing
bowtie 1.1.2 . "alapy_arc_1.1 -d test.fastq.ac - | bowtie hs_grch37 /dev/stdin" - + + + .
bowtie2 2.2.6-2 . alapy_arc_1.1 -d SRR1769225.fastq.ac - | bowtie2 -x hs_grch37 -U /dev/stdin -S output.sam - + + + .
bowtie2 2.2.6-2 (PE reads) "bowtie2 -x ./hs_grch37 -1 <(alapy_arc_1.1 -d ERR1585276_1.fastq.ac -) -2 <(alapy_arc_1.1 -d ERR1585276_2.fastq.ac -) -S out_pe.sam" - - - + paired end
trimmomatic 0.35+dfsg-1 . "alapy_arc_1.1 -d test.fastq.ac - | java -jar trimmomatic.jar SE -phred33 /dev/stdin trimmomatic_out.fq.gz LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36" - + - + .
cutadapt 1.9.1 . "alapy_arc_1.1 -d test.fastq.ac - | cutadapt -a AACCGGTT -o cutadapt_output.fastq -" - - + + .
trimgalore 0.4.4 . "alapy_arc_1.1 -d test.fastq.ac - | trim_galore -a AACCGGTT -o trimgalore_output /dev/stdin" - + + + .
bbmap
37.23" . "alapy_arc_1.1 -d test.fastq.ac - | ./bbmap.sh ref=hg19.fasta in=stdin out=mapped.sam usemodulo=t" + - - + .
BENCHMARK
We tested ALAPY Compressor on 230 diverse set of public fastq files from NCBI SRA. You can read more about it on our website http://alapy.com/services/alapy-compressor/
COMPRESSOR TESTING ON THE BENCHMARK
We observed 1.5 to 3 times compression ratio compared to gziped fastq file (fastq.gz) for the current version of the algorithm. On this figure, you can find results of compression for several representative NGS experiments including WES, WGS, RNA-seq, ChIP-seq, BS-seq using different HiSeqs, NextSeqs, Ion Torrent Proton, AB SOLiD, 4 System, Helicos Heliscope on human, mouse (both on the picture) as well as Arabidopsys, Zebrafish, Medicago, Yeasts and many other model organisms.
USAGE CASES
Our tool was used on more than 2000 different fastq files and md5 sum before and after compression for fastq files is exactly the same in all cases. We saved several TBs of space for more science. Hurray!
FUTURE WORK
We are working on improving our algorithm, on other file formats support and on a version with a little change in data, that allows a dramatic increase in compression ratio.
Please tell us what you think and how we can make it better.
Thank you,
Petr
Can you include uQ and Clumpify in comparison along with other tools mentioned in uQ thread?
Sure, genomax2. Thank you for your interest. We are planning to write a nice simple paper using our benchmark (that we will also improve a lot). We will try to provide tools for benchmark downloading and testing to the community as well. We already tested tons of tools, but on much smaller "test" benchmark. I hope we will get a few more tools to test in this thread or ideas of what and how we should test.
If you or anybody here on Biostars is interested in such a benchmark, on compression tools or in studying this topic together with, this is great! We can also write a paper together =)
Thank you (love Biostars and this community)
Hi Petr,
Can I request that you add compression support for stdout? It looks like currently decompression only supports stdout, and compression only stdin, but it would be really convenient to have a "--stdout" flag like gzip/pigz to force all output to be written to stdout. Also, while it is possible to suppress the output messages with -q, for a tool supporting primary output to stdout it might make sense to have verbose messaging directed to stderr instead so that they don't mix.
Sure, thank you for your request, Brian. I will write here when functionality you ask about is ready for testing and for public use. Also, it appears that there is no direct messaging on Biostars. If you want, you can send me an email to my personal mailbox pon dot petr at gmail dot com , so I can notify you personally when we have stdout for compression files and stdin for decompression with verbose messaging redirected to stderr for -q option.
Can you please also add xz to the comparison.
xz -9
or maybe onlyxz -3
, as it takes less time.Thank you, Deepak Tanwar. We will include xz in the comparison as well. So far we just wanted to begin the conversation about NGS data compression with biostars community. If you or anybody here know other tools you wish we had tested on our benchmark, please tell us. Thank you. Petr
Dear Deepak Tanwar
We had run a comparison of compression ratio, compression time and decompression time for xz. I know you are waiting for -3 and -9 results. While a big reply with benchmark explanation and software testing is in the works, we decided to start to publish some preliminary data. Results of xz testing are very interesting for us, so here is a sneak peak at the same samples you we used for the figure
Results are as follows for xz combined with gzip and both are with -6 parameter:
In short, xz is slower on compression time than ALAPY Compressor and creates bigger files, but decompression time is much faster. Because of this discussion, we start to wonder about proper tradeoffs between decompression speed, memory usage, compression ratio and compression time.
So we wonder, how many of you use gzip with other than default -6 parameter?
I actually use gzip with -2 for temp files and pigz with -4 for temp files or -8 or -9 for permanent files.
Incidentally, I ran a test on 1 million pairs of 2x151bp E.coli reads with various compression types:
Alapy is the clear winner, compression-wise. I do worry a little about the memory consumption, though. It seems like it was not streaming the output to a file while running. Does it go to a temp file somewhere, or does everything stay in memory until it's done?
We are working on memory concern and improved its usage in the new version 1.1 https://github.com/ALAPY/alapy_arc/tree/master/ALAPY%20Compressor%20v1.1 This one writes directly to stdout. Is that version still way too high on memory usage?
I don't really care about 1GB of overhead all that much, it's more a question of whether it uses a fixed amount of memory, or is input-dependent and might use, say, 100GB ram for a 100GB fastq file.
Memory usage is not input-depended so it will stay around 1-1.5GB for bigger fastq files like 100GB big
Currently, we do write temporary files to the hard drive when we compress and are thinking about the ways to avoid this.
And is it fast in compression and decompression? Does it allow random access?
[My knowledge of compression is fairly limited.]
These are very good questions. We gave out our ALAPY Compressor to several research labs and commercial labs. They reported low cpu and memory usage and fast compression. We will test time, cpu, memory and storage usage of different tools on a big and diverse benchmark. The current version is for archiving, so no random access yet. But in general our algorithm allows it and this is one of the many things we are developing right now.
It looks like the latest version available on your website is still v1.1.4... it does not recognize the "-l" option.
Thank you Brian. We updated alapy.com website. Should work now.
Yes, indeed it works fine now, thanks!
First of all, thank you. ALAPY compressor has helped our laboratory to reduce the size of sequencing reads made by our HiSeq a lot. What I want to ask: could you please make an option to decompress many files by one command? Like "alapy_arc -d *.ac" instead of "alapy_arc -d some_single_file.ac". It will make the usage of your program more handy.