Leaving aside the interesting question of whether 2bit is really two (unmasked) or four (masked) bits per base, I was curious about answering my hypothesis that FASTA files containing repetitive sequence data compress better than FASTA files containing maximum Shannon entropy (uncertainty) per base.
Regardless of the masking issue, I think the other question is testable with some scripting and use of Jim Kent's toolkit.
Let's first make a FASTA file called fixed_sequences_set.fa
that contains elements with sequences that are of the same base, i.e. least Shannon entropy per base.
We are comparing this file with randomized four-base data, so in order to control for the specific base being picked as a factor in the algorithm's compression efficiency, we pick a base with equal probability for each FASTA element. This helps ensure that both files will be made up of the same overall "alphabet":
#!/usr/bin/env perl
use strict;
use warnings;
my @bases = qw(A C T G);
my @freqs = qw(0.25 0.25 0.25 0.25); # probabilities should sum to 1
my @cumFreqs;
my $seqBases = 5000;
my $seqCount = 100;
my $headerPrefix = "seq_";
buildCumFreqs();
foreach my $seqIdx (1..$seqCount) {
print ">$headerPrefix$seqIdx\n";
my $base = generateBase();
foreach my $baseIdx (1..$seqBases) {
print $base;
}
print "\n";
}
sub generateBase {
my $rand = rand();
foreach my $freqIdx (0..scalar(@cumFreqs)) {
if ($rand <= $cumFreqs[$freqIdx]) {
return $bases[$freqIdx];
}
}
die "you should never see this!";
}
sub buildCumFreqs {
my $cumFreq = 0;
foreach my $freq (@freqs) {
$cumFreq += $freq;
push (@cumFreqs, $cumFreq);
}
return 0;
}
We'll make the FASTA file like so:
$ ./make_fixed_fasta.pl > fixed_sequences_set.fa
Here's what it looks like, in general:
>seq_1
TTTTTTTTTTT...TTT
>seq_2
TTTTTTTTTTT...TTT
>seq_3
AAAAAAAAA...AAA
...
>seq_100
CCCCCCCC...CCC
Next, we'll make a FASTA file called random_sequences_set.fa
that contains elements where bases are picked with maximum entropy or uncertainty, i.e., a probability of 0.25 for each base A
, C
, T
and G
:
#!/usr/bin/env perl
use strict;
use warnings;
my @bases = qw(A C T G);
my @freqs = qw(0.25 0.25 0.25 0.25);
my @cumFreqs;
my $seqBases = 5000;
my $seqCount = 100;
my $headerPrefix = "seq_";
buildCumFreqs();
foreach my $seqIdx (1..$seqCount) {
print ">$headerPrefix$seqIdx\n";
foreach my $baseIdx (1..$seqBases) {
print generateBase();
}
print "\n";
}
sub generateBase {
my $rand = rand();
foreach my $freqIdx (0..scalar(@cumFreqs)) {
if ($rand <= $cumFreqs[$freqIdx]) {
return $bases[$freqIdx];
}
}
die "you should never see this!";
}
sub buildCumFreqs {
my $cumFreq = 0;
foreach my $freq (@freqs) {
$cumFreq += $freq;
push (@cumFreqs, $cumFreq);
}
return 0;
}
We'll make the randomized FASTA file like so:
$ ./make_random_fasta.pl > random_sequences_set.fa
Here's what this looks like:
>seq_1
CTCGTCCCACAAGTCTT...TG
...
>seq_100
CGGGGTAAAGCGCACC...GC
Let's compare the file sizes of the uncompressed FASTA files:
$ ls -al *.fa
-rw-r--r-- 1 alexpreynolds staff 500892 Apr 23 16:54 fixed_sequences_set.fa
-rw-r--r-- 1 alexpreynolds staff 500892 Apr 23 16:54 random_sequences_set.fa
Both are the same size, roughly 500 kilobytes.
Now let's make them into 2bit files with the faToTwoBit
tool, which is part of the UCSC Kent toolkit, and again compare their file sizes:
$ faToTwoBit fixed_sequences_set.fa fixed_sequences_set.fa.2bit
$ faToTwoBit random_sequences_set.fa random_sequences_set.fa.2bit
$ ll *.2bit
-rw-r--r-- 1 alexpreynolds staff 127708 Apr 23 16:54 fixed_sequences_set.fa.2bit
-rw-r--r-- 1 alexpreynolds staff 127708 Apr 23 16:55 random_sequences_set.fa.2bit
When made into 2bit files, both are still the same file size. Both files have the same number of bases in each element's sequence, both files have the same element headers, and there is no masking or other manipulation of any bases.
Let's now compress each 2bit file with gzip
:
$ gzip -c fixed_sequences_set.fa.2bit > fixed_sequences_set.fa.2bit.gz
$ gzip -c random_sequences_set.fa.2bit > random_sequences_set.fa.2bit.gz
The file size of the compressed 2bit file made of fixed-base sequences is a little under 1 kB:
$ ls -al fixed_sequences_set.fa.2bit.gz
-rw-r--r-- 1 alexpreynolds staff 1001 Apr 23 17:15 fixed_sequences_set.fa.2bit.gz
The file size of the compressed 2bit file made of purely random bases is roughly 126 kB, nearly identical to the size of the original 2bit file:
$ ls -al random_sequences_set.fa.2bit.gz
-rw-r--r-- 1 alexpreynolds staff 126028 Apr 23 17:16 random_sequences_set.fa.2bit.gz
The compression efficiency of gzip
applied to a FASTA file made up of redundant sequence data is the ratio of the compressed file size to the uncompressed file size, or 1001/127708 = 0.008. Pretty decent!
The compression efficiency of gzip
applied to a FASTA file made up of random sequence data is 126028/127708 = 0.986. The use of gzip
here doesn't help very much.
If you use bzip2
, you'll find similar results — this algorithm is helped by redundancy in the data. In real-world usage, bzip2
will have a slightly greater efficiency than gzip
, at the expense of greater compression time.
Knowing the makeup of the data can help guide the choice of algorithm for the task, or perhaps suggest when not to do extra, unnecessary work that yields diminished returns.
Why dont you try to gzip/bzip a 2bit file and tell us the resulting sizes :o)
Hmm, well I didn't have any on hand and didn't know where to get one, but then I remembered seeing something about 2bit files on UCSC. So I grabbed hg19.2bit, which is 779Mb. Not surprising, since I know it includes header data to deal with things like N's and other non-ATGC data. And the gzipped version is 676Mb, only 90% of the theoretical 753Mb 1/4 FASTA file. So at least one thing works out like I expected.
hg19 also contains many "N" plus the sequence headers (e.g. ">chr1"). I don't know what effect they have on compression efficiency.
From what I gather about DEFLATE, it seems the code table can adapt to local abundances of different strings, just like LZ78. So the runs of millions of N's I've seen in hg19 should compress very efficiently, but once it exits those regions, I'd expect it to quickly rebuild a code table appropriate to "normal" sequence. Same for areas following header lines. This is what I'd expect, but as you can see, my expectations have not been correct, which is the perplexing part of this puzzle.