Question

Why Is Gzip Compression Of Fasta Less Efficient Than 2Bit?

2

Entering edit mode

12.3 years ago

Nick Stoler ▴ 70

So in the past I've had reason to gzip large FASTA files, and I'd always noticed that the result seemed roughly 1/4 size, or what I'd expect for the most obvious space optimization you can make: 2bit encoding. But then one day I actually calculated that 1/4 size and found that the gzipped file was actually a lot bigger. For instance, my copy of hg19.fa is 3011Mb, or 4*753Mb. But hg19.fa.gz is 894Mb (and hg19.fa.bz2 is 783Mb).

Now, I've learned the basics of LZ78, and even for that old algorithm, the first thing I'd expect it to do is replace A, G, C, and T with 00, 01, 10, and 11 in its code table. And that's already 2bit encoding! From what I understand of gzip's DEFLATE, it uses Huffman Coding, and so I'd expect the same behavior. And then beyond that, I'd expect even more optimizations, of course, in low-complexity regions and the like.

So I'm wondering if anyone knows enough about these compression algorithms to give an explanation? Is it something simple, like that they don't use codes as small as 2 bits? On a side note, has anyone ever gzipped a 2bit file? What's the compression ratio like for that?

dna • 12k views

ADD COMMENT • link written 12.3 years ago by Nick Stoler ▴ 70

1

Entering edit mode

Why dont you try to gzip/bzip a 2bit file and tell us the resulting sizes :o)

ADD REPLY • link 12.3 years ago by Martin A Hansen 3.0k

1

Entering edit mode

Hmm, well I didn't have any on hand and didn't know where to get one, but then I remembered seeing something about 2bit files on UCSC. So I grabbed hg19.2bit, which is 779Mb. Not surprising, since I know it includes header data to deal with things like N's and other non-ATGC data. And the gzipped version is 676Mb, only 90% of the theoretical 753Mb 1/4 FASTA file. So at least one thing works out like I expected.

ADD REPLY • link 12.3 years ago by Nick Stoler ▴ 70

0

Entering edit mode

hg19 also contains many "N" plus the sequence headers (e.g. ">chr1"). I don't know what effect they have on compression efficiency.

ADD REPLY • link 12.3 years ago by Neilfws 49k

0

Entering edit mode

From what I gather about DEFLATE, it seems the code table can adapt to local abundances of different strings, just like LZ78. So the runs of millions of N's I've seen in hg19 should compress very efficiently, but once it exits those regions, I'd expect it to quickly rebuild a code table appropriate to "normal" sequence. Same for areas following header lines. This is what I'd expect, but as you can see, my expectations have not been correct, which is the perplexing part of this puzzle.

ADD REPLY • link 12.3 years ago by Nick Stoler ▴ 70

Ram · Answer 1 · 2013-04-23

Leaving aside the interesting question of whether 2bit is really two (unmasked) or four (masked) bits per base, I was curious about answering my hypothesis that FASTA files containing repetitive sequence data compress better than FASTA files containing maximum Shannon entropy (uncertainty) per base.

Regardless of the masking issue, I think the other question is testable with some scripting and use of Jim Kent's toolkit.

Let's first make a FASTA file called fixed_sequences_set.fa that contains elements with sequences that are of the same base, i.e. least Shannon entropy per base.

We are comparing this file with randomized four-base data, so in order to control for the specific base being picked as a factor in the algorithm's compression efficiency, we pick a base with equal probability for each FASTA element. This helps ensure that both files will be made up of the same overall "alphabet":

make_fixed_fasta.pl

#!/usr/bin/env perl

use strict;
use warnings;

my @bases = qw(A C T G);
my @freqs = qw(0.25 0.25 0.25 0.25); # probabilities should sum to 1
my @cumFreqs;
my $seqBases = 5000;
my $seqCount = 100;
my $headerPrefix = "seq_";

buildCumFreqs();
foreach my $seqIdx (1..$seqCount) {
  print ">$headerPrefix$seqIdx\n";
  my $base = generateBase();
  foreach my $baseIdx (1..$seqBases) {
    print $base;
  }
  print "\n";
}

sub generateBase {
  my $rand = rand();
  foreach my $freqIdx (0..scalar(@cumFreqs)) {
    if ($rand <= $cumFreqs[$freqIdx]) {
      return $bases[$freqIdx];
    }
  }
  die "you should never see this!";
}

sub buildCumFreqs {
  my $cumFreq = 0;
  foreach my $freq (@freqs) {
    $cumFreq += $freq;
    push (@cumFreqs, $cumFreq);
  }
  return 0;
}

We'll make the FASTA file like so:

$ ./make_fixed_fasta.pl > fixed_sequences_set.fa

Here's what it looks like, in general:

>seq_1
TTTTTTTTTTT...TTT
>seq_2
TTTTTTTTTTT...TTT
>seq_3
AAAAAAAAA...AAA
...
>seq_100
CCCCCCCC...CCC

Next, we'll make a FASTA file called random_sequences_set.fa that contains elements where bases are picked with maximum entropy or uncertainty, i.e., a probability of 0.25 for each base A, C, T and G:

make_random_fasta.pl

#!/usr/bin/env perl

use strict;
use warnings;

my @bases = qw(A C T G);
my @freqs = qw(0.25 0.25 0.25 0.25);
my @cumFreqs;
my $seqBases = 5000;
my $seqCount = 100;
my $headerPrefix = "seq_";

buildCumFreqs();
foreach my $seqIdx (1..$seqCount) {
  print ">$headerPrefix$seqIdx\n";
  foreach my $baseIdx (1..$seqBases) {
    print generateBase();
  }
  print "\n";
}

sub generateBase {
  my $rand = rand();
  foreach my $freqIdx (0..scalar(@cumFreqs)) {
    if ($rand <= $cumFreqs[$freqIdx]) {
      return $bases[$freqIdx];
    }
  }
  die "you should never see this!";
}

sub buildCumFreqs {
  my $cumFreq = 0;
  foreach my $freq (@freqs) {
    $cumFreq += $freq;
    push (@cumFreqs, $cumFreq);
  }
  return 0;
}

We'll make the randomized FASTA file like so:

$ ./make_random_fasta.pl > random_sequences_set.fa

Here's what this looks like:

>seq_1
CTCGTCCCACAAGTCTT...TG
...
>seq_100
CGGGGTAAAGCGCACC...GC

Let's compare the file sizes of the uncompressed FASTA files:

$ ls -al *.fa
-rw-r--r-- 1 alexpreynolds staff 500892 Apr 23 16:54 fixed_sequences_set.fa
-rw-r--r-- 1 alexpreynolds staff 500892 Apr 23 16:54 random_sequences_set.fa

Both are the same size, roughly 500 kilobytes.

Now let's make them into 2bit files with the faToTwoBit tool, which is part of the UCSC Kent toolkit, and again compare their file sizes:

$ faToTwoBit fixed_sequences_set.fa fixed_sequences_set.fa.2bit
$ faToTwoBit random_sequences_set.fa random_sequences_set.fa.2bit
$ ll *.2bit
-rw-r--r-- 1 alexpreynolds staff 127708 Apr 23 16:54 fixed_sequences_set.fa.2bit
-rw-r--r-- 1 alexpreynolds staff 127708 Apr 23 16:55 random_sequences_set.fa.2bit

When made into 2bit files, both are still the same file size. Both files have the same number of bases in each element's sequence, both files have the same element headers, and there is no masking or other manipulation of any bases.

Let's now compress each 2bit file with gzip:

$ gzip -c fixed_sequences_set.fa.2bit > fixed_sequences_set.fa.2bit.gz
$ gzip -c random_sequences_set.fa.2bit > random_sequences_set.fa.2bit.gz

The file size of the compressed 2bit file made of fixed-base sequences is a little under 1 kB:

$ ls -al fixed_sequences_set.fa.2bit.gz 
-rw-r--r-- 1 alexpreynolds staff 1001 Apr 23 17:15 fixed_sequences_set.fa.2bit.gz

The file size of the compressed 2bit file made of purely random bases is roughly 126 kB, nearly identical to the size of the original 2bit file:

$ ls -al random_sequences_set.fa.2bit.gz 
-rw-r--r-- 1 alexpreynolds staff 126028 Apr 23 17:16 random_sequences_set.fa.2bit.gz

The compression efficiency of gzip applied to a FASTA file made up of redundant sequence data is the ratio of the compressed file size to the uncompressed file size, or 1001/127708 = 0.008. Pretty decent!

The compression efficiency of gzip applied to a FASTA file made up of random sequence data is 126028/127708 = 0.986. The use of gzip here doesn't help very much.

If you use bzip2, you'll find similar results — this algorithm is helped by redundancy in the data. In real-world usage, bzip2 will have a slightly greater efficiency than gzip, at the expense of greater compression time.

Knowing the makeup of the data can help guide the choice of algorithm for the task, or perhaps suggest when not to do extra, unnecessary work that yields diminished returns.

view raw biostars-69872.md hosted with ❤ by GitHub

Ram · Answer 2 · 2013-04-23

1

Entering edit mode

12.3 years ago

Alex Reynolds 36k

I suspect that greater entropy or "randomness" in most genomic sequences — as well as a Huffman table often using more bits per character than 2bit — is why gzip could perform worse than 2bit.

To explain, the 2bit format stores 4 bits for each base: two bits for the base and another two to classify the type of base (intron, exon or unknown). That is in addition to some minor overhead for metadata for the file and for each FASTA record.

The gzip DEFLATE procedure generally uses a mix of Huffman coding, which builds a variable-bit-length dictionary from raw input processed with LZ77, which looks for and encodes repetitive regions.

It seems reasonable to believe that 2bit would be more efficient with genomic regions containing "noisy" or "random" sequence data, i.e., a good portion of a given genome, I'd wager. With 2bit, every base in a FASTA record's sequence data gets compressed to the same four bits, regardless of the overall information content in the sequence being squeezed.

Some special areas of the genome can be redundant - transcription factor motifs, repeats like satellite DNA, etc. - and will allow LZ77 to perform better, being able to squeeze those redundancies to a greater degree.

Once redundant data are compressed with LZ77, those data are reduced further with a Huffman encoding table, which I think could often use more than four bits per table entry, because there are at least ten bases (actgACTGnN), lexicographical characters for FASTA headers, and special characters used in the LZ77 encoding that describe the redundancies.

So, in sum, 2bit always uses four bits per base (plus minor overhead), regardless of whether the sequence data is repetitive or not. As there is probably generally less redundancy in most genomic data to take advantage of, the gzip procedure likely performs worse, for typical, average inputs.

To answer your last question, it is difficult to compress a file with a series of two or more modern algorithms and expect better results. As you compress data, you lose bits that any applied algorithm can take out — that's the definition of compression, after all!

Otherwise, you could just apply a magical compression algorithm repeatedly to the same data, in series, and reduce the original input by some exponential power. With real world compression algorithms, at some point, you can't take any more squished stuff out and still be able to get your old, un-squished stuff back.

What you could do, I think, is take a 2bit representation of what you know in advance is highly redundant DNA sequence data, squeezing that with a compression tool that is designed to take advantage of redundant input, like gzip or bzip2. You might be able to squeeze that a bit more.

ADD COMMENT • link updated 5.7 years ago by Ram 45k • written 12.3 years ago by Alex Reynolds 36k

1

Entering edit mode

Hrm, 2bit uses 2 bits per base - I don't know where you got the extra 2 bits from? https://genome.ucsc.edu/FAQ/FAQformat.html#format7

ADD REPLY • link 12.3 years ago by Martin A Hansen 3.0k

1

Entering edit mode

I read a different page that described the specification for 2bit, which indicated a two-bit representation for bases, padded to a 32-bit word, and an equivalent length string for the mask. Probably should have read the UCSC spec, instead: http://jcomeau.freeshell.org/www/genome/2bitformat.html

ADD REPLY • link 12.3 years ago by Alex Reynolds 36k

2

Entering edit mode

Googling 2bit ucsc gives me this page, which says that blog page gives the specification. This is wrong. That blog is describing the blogger's own format, having nothing to do with the widely used UCSC 2bit format. For repeatMasked sequences, UCSC format is far more space efficient.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Yes. Also from the output file size, the format really uses ~2 bits per base. There is also no intron/exon information in 2bits.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Oh wow, I did forget about the issue of lowercase characters. A simple grep -c shows fully 52% of the lines in hg19.fa contain '[agtc]'. That could be it. edit: Also, duh, newlines. Seems they're on 100% of lines!
Though I am surprised to learn how deceptively the 2bit format is named. Also, as I mentioned in my reply to martinahansen, I got a copy of hg19.2bit, and it's pretty close to 1/4 the size of the FASTA version. Are there versions that forgo the annotation and extra 2 bits per base?
Oh, and it turns out hg19.2bit can gzip to 87% ratio. I thought it might not be fruitless, since 2bit "compression" is far from complete, and gzip could then take advantage of any repetition in the actual sequence itself. But no, can't say it's a very practical idea!

ADD REPLY • link 12.3 years ago by Nick Stoler ▴ 70

1

Entering edit mode

No, the presence of lowercase letters and newlines do not explain why 2bit is better. You can convert all bases to uppercase, remove all newlines and fasta headers. This makes the whole file only contain uppercase A/C/G/T/N. You will find that 2bit still does better. My guess is we need to encode additional symbols with huffman (e.g. stream end and reset), which takes additional bits. I could be wrong, though.

ADD REPLY • link 12.3 years ago by lh3 33k

0

Entering edit mode

Hmm, yep, it looks like even when you uppercase everything and remove newlines, 2bit still slightly beats gzip.
But I'm not sure anymore that any of these extra symbols like lowercase, newlines, stream end/reset, or even N's can explain it. All of these should be far less common than ATGC in the vast majority of blocks, so ATGC should still be encoded with the shortest Huffman codes (2 bits or even less). The only explanation I can think of is that, for some reason, gzip doesn't use ultra-short codes like 1- and 2-bit ones.

ADD REPLY • link 12.3 years ago by Nick Stoler ▴ 70

2

Entering edit mode

I have asked James Bonfield, arguably the best on bioinfo-related data compression. He has confirmed my speculation: in addition to a/c/g/t, gzip always needs to encode the end-of-block symbol. With 5 symbols, one of a/c/g/t must take 3 bits. If you compress a string with 3 symbols, you will find gzip works much better because in that case gzip can use 2-bit code for every symbol.

ADD REPLY • link 12.3 years ago by lh3 33k

score 0 · Answer 3 · 2013-04-23

The compression code should be a prefix code i.e. the 4 nts will be something like 0, 10, 110, 1110 to allow for other less frequent characters (like \n, N etc), so most of the code will be coded with 2.5 bits on average. Even if the coding is different than my simple example it has to be more than 2 bits on average.