Genome Compression: Why Not Just Use 7Z On Stripped Vcf?
1
0
Entering edit mode
11.1 years ago
daattali ▴ 50

Hi,

I'm new to genome compression, and I was reading through this recent paper and some of their results left me with an unanswered question (paper available here)

In Table 5, they show that by simply stripping away all non-essential fields of a VCF file and then compressing it with 7z, it achieves excellent compression compared with other genome compression algorithms (1.7MB for human genome). It made me wonder why this hasn't been used if it's so simple?

• 2.9k views
ADD COMMENT
5
Entering edit mode
11.1 years ago

because

ADD COMMENT
0
Entering edit mode

...particularly the second point (though pragmatically life is much easier for computational biologists using linux :)

ADD REPLY
0
Entering edit mode

...the point being not bgzip and tabix specifically, but that random access is important so that, together with some kind of index, you can efficiently answer questions about a subset of your data -- e.g., show me the variants in some particular region of the genome.

ADD REPLY
0
Entering edit mode

That makes sense, I didn't know that random access is such a high priority. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 1992 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6