Do you know of any initiatives for NGS alignments compression?
BAM format offers compression, but still all aligned sequences and their qualities are stored. Do you know of any reference based compression? I think people at ENA are working in that matter. Have a look at CRAM.
What is you opinion about keeping qualities? Maybe using some quality thresholds is reasonable? Or storing qualities only for mismatches (and maybe for +- 3 bases)?
And what about sequence headers? Do we need to keep this at all in the alignment? Storing pair-end information should be enough in my opinion.
If it was just about saving space you could get rid of the FASTQ input data, after you stored all sequences and qualities in a BAM file. FASTQ can then be generated from the BAM file. It is a matter of compromise when discarding data. In this case discarding quality and sequence compromises re-analysis. Btw, your question is about compression, but also about discarding data != compression.
Here is a paper describing a scheme for comparative compression of genomes from the same species. You might want to figure out first why you want to compress your data. If you can generate some of these metrics such as quality scores on the fly, then try compressing data after the last step that is computationally intensive.
Goby 2.0 is a major milestone for the Goby project which brings state of the art NGS alignment compression as well as very robust SAM/BAM import exports. See the new tutorial ‘What’s new in Goby 2.0‘ for more information.
We created a summary table to compare features of Goby 1.x, 2.0, BAM, CRAM and FASTQ. Click here to see the full table.
If it was just about saving space you could get rid of the FASTQ input data, after you stored all sequences and qualities in a BAM file. FASTQ can then be generated from the BAM file. It is a matter of compromise when discarding data. In this case discarding quality and sequence compromises re-analysis. Btw, your question is about compression, but also about discarding data != compression.
yes, I'm curious what are your opinions about lossless compression vs compression discarding some data (like sequence headers, some quals, etc)