I've seen a few papers proposing algorithms to compress DNA, which as I understand is quite a challenge even for seasoned researchers in the area. Although some of those have been implemented (DNAcompress is one example I know of), I have no idea which of these algorithms/implementations are actually used by biologists or databases.
So, are any DNA compression method proposed in the computer science literature actually used in practice? And if so, what are the most popular ones?
(Please provide links to actual software or websites to support your answers)
ADD REPLY
• link
updated 5.2 years ago by
Ram
44k
•
written 13.4 years ago by
Ketil
4.1k
0
Entering edit mode
Since this just got bumped, I'll note that there was just a review paper published on this topic in AMB which covers this question from a practitioners point of view, and the pros and cons of various approaches.
Thansk for the link to Birney's blog, I didn't know that one. Although I'm skeptical, especially to the premise that saving space is particularly important.
The most common form of DNA compression is very simple. The basic idea is to just binary encode the DNA sequence. For example, you could use 2 bit for sequences that only contain [ATGC] or 4 bit for sequences that contain other alternative bases such as [ATGCYNR]. Just by binary encoding the sequence you can cut the file size down 75% with 2 bit encoding, and 50% with 4 bit encoding.
The advantage of this method is that the file can be easily parsed again without needing complicated compression algorithms. Using a binary encoded DNA sequence reduces the memory foot print of a large DNA sequence such as human's as well.
As a side note binary encoding DNA sequences is quite common. Most hash based indexing methods hash DNA sequences using their 4 bit / 2 bit encoding to save space, furthermore, they tend to store the reference DNA sequence as a binary encoded string. For example, NovoAlign does this.
Burrows-Wheeler Transform based aligners store the BWT DNA sequence as a 2 bit binary encoded sequence.
EDIT:
You could take this further by block compressing the binary encoded blocks using zlib but this will have a speed trade off if you are parsing the files frequently. Samtools uses a block compressiong scheme with zlib to compress binary encoded SAM files.
Sorry, but binary encoding of the DNA sequence using 2 bits per letter is NOT compression!!!!!
In order to have compression for a DNA sequence one should show that it uses strictly less than 2 bits per letter!!!
The definition of compression is "In computer science and information theory, data compression, source coding, or bit-rate reduction involves encoding information using fewer bits than the original representation" (from: http://en.wikipedia.org/wiki/Data_compression).
So in case of DNA sequence one has an alphabet of 4 letters which requires a 2 bits per letter when no compression is done. When one stores a DNA sequence in a plain text file then indeed the DNA sequence is stored using 8-bits (or more) per letter but this is plainly inefficient way of storing it (it is like storing a binary string in a plain text file)! the process of converting a DNA sequence which was stored in a plain text file (where 8-bit per character) to 2-bit per character is NOT compression because the original representation was 2-bit to begin with!
ADD REPLY
• link
updated 2.9 years ago by
Ram
44k
•
written 10.3 years ago by
enxxx23
▴
280
1
Entering edit mode
that is just 'semantics'. 2bit DNA is 'compressed' compared to a fasta file.
that is just 'semantics'. 2bit DNA is 'compressed' compared to a fasta file.
Sorry, by this is totally wrong! I suggest very gently to read some books and literature about compression!
FASTA file is not a valid reference for computing the compression ratio!!!!
All DNA compression algorithms have to use as reference the 2 bits per nucleotide! If they are able to compress the DNA sequence using less than 2 bits then we are having real compression anything else above 2 bits (like FASTA format) is just semantic and has nothing to do with compression!
ADD REPLY
• link
updated 5.1 years ago by
Ram
44k
•
written 9.5 years ago by
enxxx23
▴
280
0
Entering edit mode
Indeed, the only information that can't be compressed is totally random information which contains no predictable patterns at all. DNA contains many predictable patterns and repetitions. Even WinRAR should be able to compress DNA fairly well compared to task specific programs. winrar takes words like "this" "the" that repeat many times and marks them with a number. DNA also contains many times the same entry so WinRAR should be able to compress it very well.
Feature-wise the current version is equivalent to Markus' proof-of-concept code, but about an order of magnitude faster. Future versions will include better treatment of paired-end reads and the support of a variety of sequencing platforms, starting by SOLID but also others. Right now it's still very much in beta, so it's not officially ready for production.
My guess for the most commonly used compression would be gzip or bzip2, applied to fasta files.
Another common compression scheme is Burroughs-Wheeler compressed index as used for indexing for [?]most of the[?] maybe some short read aligners these days. Although the primary idea is indexing for fast searching, the compression means that the index is smaller than the original data. This is actually a variant of the suffix array, pioneered, I think, by Manzini and Ferragina.
And, as pointed out, another common scheme is binary encoding of nucleotides. This has its own format (TwoBit), and is used by BLAT and the UCSC browser, and also in various hashing schemes.
BWT aligners don't actually compress the genome, at least BowTie doesn't. It just stores a 2 bit encoded BWT transform, plus a bunch of suffix array indices and other data. The FM index, however, does compress the BWT and it's associated data in blocks
Interesting! I had just assumed they all used compression as well, since the BWT gets them half-way there. In my own experiments, compressed indices were a lot slower than plain suffix arrays, though. On the other hand, BWA appears to use the FM index, and refers to it from their paper (http://bioinformatics.oxfordjournals.org/content/26/5/589.full).
My understanding is that the BWT of a genome is likely to contain many long runs of a single character, and so is much more easily compressed (e.g. by RLE) than the original genome. But the BWT itself does not reduce the storage size of a genome
Not being able to trust raw sequencing data is a real problem for compression. If there are errors, it can be difficult to know what you can drop if anything. It also forces software to deal with alignment issues, which can really degrade the big O notation of genetically related search algorithms. That really brings up the question, what will it take to get a significantly higher accuracy in sequence data?
Once you have high accuracy data, you should be able to map larger and larger sequences to a library of known global identifiers. I'm learning so that may be what everyone means by leveraging a reference genome.
Another perspective:
Many people are moving towards "Cloud Storage". Storing lots of people's music has similar scaling issues. If cloud providers are forced into essentially storing each person's hard drive, they will get overwhelmed because they're responsible for everyone's unique bytes. But if they can match an entire song to a single global identifier and then have 100Ks people just reference that identifier, you can scale that much more efficiently.
I think what Birney & co mean, is that by assuming a shared, known reference genome, you only need to store and communicate the differences - think of it as a pre-shared LZ dictionary. Even if you store the ref along with the diffs, it can be a net win. The hard part is quality information, where you need to make some sacrifices.
They use this compression algorithm to get the distance measure (which then can be used to reconstruct the phylogeny) between genomes, chain letters, and plagiarized schoolwork. The method is simple:
Take 2 genomes
Compress the genomes individually to get sizes A and B (in bytes)
Compress the concatenation of the original genomes to get size C (in bytes)
Interestingly, there is a paper from 1999 with the title "Protein is incompressible". http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.16.1412&rep=rep1&type=pdf And http://www.dcc.uchile.cl/~gnavarro/ps/bibe10.pdf states directly that "In general, DNA and proteins are not compressible [...] using classical methods".
I experimented with various compression algorithms on Fastq data, and did a writeup at http://blog.malde.org/index.php/2011/06/22/compressing-biological-sequences/
Since this just got bumped, I'll note that there was just a review paper published on this topic in AMB which covers this question from a practitioners point of view, and the pros and cons of various approaches.
cited in "Analysis of Compression Techniques for DNA Sequence Data" https://www.researchgate.net/profile/Mir_Hassan4/publication/327719513_Analysis_of_Compression_Techniques_for_DNA_Sequence_Data/links/5ba08f9d45851574f7d272e3/Analysis-of-Compression-Techniques-for-DNA-Sequence-Data.pdf