Why Is Gzip Compression Of Fasta Less Efficient Than 2Bit?
5
2
Entering edit mode
11.6 years ago
Nick Stoler ▴ 70

So in the past I've had reason to gzip large FASTA files, and I'd always noticed that the result seemed roughly 1/4 size, or what I'd expect for the most obvious space optimization you can make: 2bit encoding. But then one day I actually calculated that 1/4 size and found that the gzipped file was actually a lot bigger. For instance, my copy of hg19.fa is 3011Mb, or 4*753Mb. But hg19.fa.gz is 894Mb (and hg19.fa.bz2 is 783Mb).

Now, I've learned the basics of LZ78, and even for that old algorithm, the first thing I'd expect it to do is replace A, G, C, and T with 00, 01, 10, and 11 in its code table. And that's already 2bit encoding! From what I understand of gzip's DEFLATE, it uses Huffman Coding, and so I'd expect the same behavior. And then beyond that, I'd expect even more optimizations, of course, in low-complexity regions and the like.

So I'm wondering if anyone knows enough about these compression algorithms to give an explanation? Is it something simple, like that they don't use codes as small as 2 bits? On a side note, has anyone ever gzipped a 2bit file? What's the compression ratio like for that?

dna • 10k views
ADD COMMENT
1
Entering edit mode

Why dont you try to gzip/bzip a 2bit file and tell us the resulting sizes :o)

ADD REPLY
1
Entering edit mode

Hmm, well I didn't have any on hand and didn't know where to get one, but then I remembered seeing something about 2bit files on UCSC. So I grabbed hg19.2bit, which is 779Mb. Not surprising, since I know it includes header data to deal with things like N's and other non-ATGC data. And the gzipped version is 676Mb, only 90% of the theoretical 753Mb 1/4 FASTA file. So at least one thing works out like I expected.

ADD REPLY
0
Entering edit mode

hg19 also contains many "N" plus the sequence headers (e.g. ">chr1"). I don't know what effect they have on compression efficiency.

ADD REPLY
0
Entering edit mode

From what I gather about DEFLATE, it seems the code table can adapt to local abundances of different strings, just like LZ78. So the runs of millions of N's I've seen in hg19 should compress very efficiently, but once it exits those regions, I'd expect it to quickly rebuild a code table appropriate to "normal" sequence. Same for areas following header lines. This is what I'd expect, but as you can see, my expectations have not been correct, which is the perplexing part of this puzzle.

ADD REPLY
5
Entering edit mode
11.6 years ago

ADD COMMENT
2
Entering edit mode

Actually, UCSC never uses 4 bits per base. This is not a question, but a conclusion. Google search is misleading. On bzip2 vs gzip, bzip2 is sometimes faster for sequence data (you can try on chr11). Gzip shines really on decompression speed. On real data, compression is beyond your two extreme examples. Real sequence data are almost always compressible, but frequently that is not what gzip/bzip2 can solve. The right choice extends to research areas. For genome compression for example, gzip'd 2-bit encoded BWT can save you ~15% space in comparison to gzip'd genome sequence. For compressing resequencing reads, it is possible to compress 115Gbp error corrected human reads all the way down to 4GB losslessly. For compressing large-scale multi-sample genotype data, right data structure can achieve 0.02% compression ratio. These are not something normal compression can achieve. There are long papers on compressing resequencing reads and on generic fastq and there is even a bounty on sequence compression. You might be interested in these.

ADD REPLY
1
Entering edit mode
11.6 years ago

I suspect that greater entropy or "randomness" in most genomic sequences — as well as a Huffman table often using more bits per character than 2bit — is why gzip could perform worse than 2bit.

To explain, the 2bit format stores 4 bits for each base: two bits for the base and another two to classify the type of base (intron, exon or unknown). That is in addition to some minor overhead for metadata for the file and for each FASTA record.

The gzip DEFLATE procedure generally uses a mix of Huffman coding, which builds a variable-bit-length dictionary from raw input processed with LZ77, which looks for and encodes repetitive regions.

It seems reasonable to believe that 2bit would be more efficient with genomic regions containing "noisy" or "random" sequence data, i.e., a good portion of a given genome, I'd wager. With 2bit, every base in a FASTA record's sequence data gets compressed to the same four bits, regardless of the overall information content in the sequence being squeezed.

Some special areas of the genome can be redundant - transcription factor motifs, repeats like satellite DNA, etc. - and will allow LZ77 to perform better, being able to squeeze those redundancies to a greater degree.

Once redundant data are compressed with LZ77, those data are reduced further with a Huffman encoding table, which I think could often use more than four bits per table entry, because there are at least ten bases (actgACTGnN), lexicographical characters for FASTA headers, and special characters used in the LZ77 encoding that describe the redundancies.

So, in sum, 2bit always uses four bits per base (plus minor overhead), regardless of whether the sequence data is repetitive or not. As there is probably generally less redundancy in most genomic data to take advantage of, the gzip procedure likely performs worse, for typical, average inputs.

To answer your last question, it is difficult to compress a file with a series of two or more modern algorithms and expect better results. As you compress data, you lose bits that any applied algorithm can take out — that's the definition of compression, after all!

Otherwise, you could just apply a magical compression algorithm repeatedly to the same data, in series, and reduce the original input by some exponential power. With real world compression algorithms, at some point, you can't take any more squished stuff out and still be able to get your old, un-squished stuff back.

What you could do, I think, is take a 2bit representation of what you know in advance is highly redundant DNA sequence data, squeezing that with a compression tool that is designed to take advantage of redundant input, like gzip or bzip2. You might be able to squeeze that a bit more.

ADD COMMENT
1
Entering edit mode

Hrm, 2bit uses 2 bits per base - I don't know where you got the extra 2 bits from? https://genome.ucsc.edu/FAQ/FAQformat.html#format7

ADD REPLY
1
Entering edit mode

I read a different page that described the specification for 2bit, which indicated a two-bit representation for bases, padded to a 32-bit word, and an equivalent length string for the mask. Probably should have read the UCSC spec, instead: http://jcomeau.freeshell.org/www/genome/2bitformat.html

ADD REPLY
2
Entering edit mode

Googling 2bit ucsc gives me this page, which says that blog page gives the specification. This is wrong. That blog is describing the blogger's own format, having nothing to do with the widely used UCSC 2bit format. For repeatMasked sequences, UCSC format is far more space efficient.

ADD REPLY
0
Entering edit mode

Yes. Also from the output file size, the format really uses ~2 bits per base. There is also no intron/exon information in 2bits.

ADD REPLY
0
Entering edit mode

Oh wow, I did forget about the issue of lowercase characters. A simple grep -c shows fully 52% of the lines in hg19.fa contain '[agtc]'. That could be it. edit: Also, duh, newlines. Seems they're on 100% of lines!
Though I am surprised to learn how deceptively the 2bit format is named. Also, as I mentioned in my reply to martinahansen, I got a copy of hg19.2bit, and it's pretty close to 1/4 the size of the FASTA version. Are there versions that forgo the annotation and extra 2 bits per base?
Oh, and it turns out hg19.2bit can gzip to 87% ratio. I thought it might not be fruitless, since 2bit "compression" is far from complete, and gzip could then take advantage of any repetition in the actual sequence itself. But no, can't say it's a very practical idea!

ADD REPLY
1
Entering edit mode

No, the presence of lowercase letters and newlines do not explain why 2bit is better. You can convert all bases to uppercase, remove all newlines and fasta headers. This makes the whole file only contain uppercase A/C/G/T/N. You will find that 2bit still does better. My guess is we need to encode additional symbols with huffman (e.g. stream end and reset), which takes additional bits. I could be wrong, though.

ADD REPLY
0
Entering edit mode

Hmm, yep, it looks like even when you uppercase everything and remove newlines, 2bit still slightly beats gzip.
But I'm not sure anymore that any of these extra symbols like lowercase, newlines, stream end/reset, or even N's can explain it. All of these should be far less common than ATGC in the vast majority of blocks, so ATGC should still be encoded with the shortest Huffman codes (2 bits or even less). The only explanation I can think of is that, for some reason, gzip doesn't use ultra-short codes like 1- and 2-bit ones.

ADD REPLY
2
Entering edit mode

I have asked James Bonfield, arguably the best on bioinfo-related data compression. He has confirmed my speculation: in addition to a/c/g/t, gzip always needs to encode the end-of-block symbol. With 5 symbols, one of a/c/g/t must take 3 bits. If you compress a string with 3 symbols, you will find gzip works much better because in that case gzip can use 2-bit code for every symbol.

ADD REPLY
0
Entering edit mode
11.6 years ago
Asaf 10k

The compression code should be a prefix code i.e. the 4 nts will be something like 0, 10, 110, 1110 to allow for other less frequent characters (like \n, N etc), so most of the code will be coded with 2.5 bits on average. Even if the coding is different than my simple example it has to be more than 2 bits on average.

ADD COMMENT

Login before adding your answer.

Traffic: 2330 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6