Hello Ewan et al,
Recently I met several questions related to format and compression. Thus I decided to do some experiments myself as I found I did not know much myself.
I am looking at the first 30 million reads from SRR065390, which is the whole-genome resequence data from C. elegans. I chose C. elegans, instead of a human chromosome, to see the effect of unmapped reads. I aligned the paired-end reads with BWA and only retained the following tags: RG, AM, SM, NM and MD. The BAM file size is about 1.7GB.
I tried the following formats/programs: samtools, picard (1.63; java 1.6.0_25), BioHDF (0.4a), Goby (1.9.8.3.1), cramtools (~0.7, cloned a coupled of days ago), gzip, bzip2 and pbzip2. I run picard mainly to figure out how much time cramtools spends on BAM I/O. I am running them on an old Linux server. Probably I'd better try a more recent machine in case of non-typical sw/hw configuration. Anyway, here are some crude numbers:
Id cpuTime realTime outputSize BriefCommandLine
=========================================================================================
(1) 582 599 1766193610 samtools view -bS # SAM=>BAM
(2) 546 576 1766193610 samtools view -b # BAM=>BAM
(3) 39 39 N/A samtools index # indexing BAM (reading through)
(4) 497 600 1814906779 SamFormatConverter.jar I=aln.bam SILENT # BAM=>BAM
(5) 1001 1052 1003208626 cram --all-{qual,tags} --unmapped # BAM=>CRAM
(6) 2132 2088 1642777948 cramtools.jar bam # CRAM=>BAM
(7) 5793 8711 2814984166 bioh5g_import_alignments # SAM=>BioHDF; failed
(8) 2134 1987 317646256 goby.jar -m stc -p threshold=1 # SAM=>Goby
(9) 597 620 1644333705 gzip -c # SAM=>SAM.gz
(10) 70 89 N/A gzip -dc # SAM.gz=>SAM
(11) 2065 2076 1376445754 bzip2 -c # SAM=>SAM.bz2
(12) 372 373 N/A bzip2 -dc # SAM.bz2=>SAM
(13) 2059 256 1378699551 pbzip2 -cp8 # SAM=>SAM.bz2, 8 threads
(14) 413 50 N/A pbzip2 -dcp8 # SAM.bz2=>SAM, 8 threads
(15) 164 165 N/A samtools view # BAM=>SAM, 2.6GB exome BAM
(16) 55 56 N/A samtools index # indexing, 2.6GB exome BAM
(17) 1517 1540 N/A sam-dump -u # cSRA=>SAM, 1.8GB exome cSRA
(18) 140 141 N/A samtools depth > /dev/null # per-base depth
(19) 4903 4815 N/A mpileup -C50 -Euf|bcftools # SNP+INDEL calling
My interpretation to the results:
a) Cramtools first. I am running cramtools in largely the lossless mode, but with most read names converted to integers. This is why BAM converted back from CRAM is smaller [(6) vs. (1)]. The Cram format is superior to BAM in terms of file size. It is 40% smaller on disk [=(1.64-1.00)/1.64, ref. (5) and (6)] in the lossless mode and will be much smaller in the lossy mode. The BAM=>CRAM conversion is very fast, only twice as slow as doing BAM=>BAM conversion [(4) vs. (5)]. However, my major concern is the slow CRAM=>BAM conversion. Cramtools takes about 1635 sec [=2132-497, ref. (4) and (6)] to reconstruct the alignment information excluding the time spent on BAM writing. Samtools only takes 39 sec [ref. (3)] to read through the entire BAM. Is it a good way to measure the the speed of Cram decoding? Am I using cramtools the best way or is my result typical? If the timing is about right, is it possible to substantially optimize the CRAM=>BAM conversion? Although samtools SNP calling is slower than Cram=>BAM conversion [ref. (19)], simple tasks such as computing depth and collecting basic statistics on BAM will be by far slower with Cram [ref. (18)], which is worrying. In addition, the most time consuming step is BAQ, which may be replaced by better strategies and be naturally deprecated by longer reads in future.
b) Then cSRA. The BAM<=>cSRA conversion is now available in the SRAtoolkit. The documentation is sparse at the moment. I do not know how to do the BAM=>cSRA conversion. So I downloaded existing human exome files here:
ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase1/technical/ncbi_varpipe_data/compression_analysis/DATA/FIN/HG00311.mapped.illumina.mosaik.FIN.exome.20110411/no-quantized.{csra,bam}
to evaluate the decoding speed. cSRA is also superior to BAM in terms of file size (1.8G vs 2.6G), but is also much slower for cSRA=>SAM conversion [ref. (15) and (17)].
c) BioHDF reports errors at the end of conversion. The possibly premature file size is larger than BAM. Nonetheless, BioHDF is editable, while BAM is static.
d) The Goby format. The Goby output is much smaller than others. But from the Goby=>SAM output, it seems that Goby discards base quality, read names together with pairing, optional tags and also CIGARs. Thus goby may be an efficient intermediate format for variant calling, but does not serve the purpose of archiving. In addition, when doing Goby=>SAM conversion, only a small fraction of alignments are outputted. I do not know why. The generated Goby file is rejected by IGV.
e) Compression algorithms. There have been quite a few threads on the compression algorithms. There is nothing new here in my table: gzip is much faster than bzip2, which is why we chose zlib instead of bzlib to compress BAM.
I am mostly worrying whether I wrongly used some programs in the table. I apologize if there are major errors and hope others can correct me.
If the table is about right, my summary is that the advantage of BAM is its relatively simple structure and fast decoding. Both cSRA and Cram have the advantage of much smaller file sizes, which will be significant when lossy compression is enabled. At present, Cram seems more mature than cSRA. Goby and BioHDF have unique features of their own, but they might not be ready for general production uses.
Looking forward, I think BAM needs a revamp ultimately given the increasing read lengths and accumulating flaws in the initial design. Cram right now seems the best candidate as a replacement of BAM. On the other hand, my opinion is the first priority of the cram development is to optimize the decoding speed. The current decoding speed may be a concern for a large project like 1000g.
I have not decided what is the best co-development model between cramtools and samtools. This needs more thoughts.
regards,
Heng
On Mar 8, 2012, at 12:43 PM, Ewan Birney wrote:
I am not sure that everyone here is on the cramtools mailing list, so just to say that CRAM 0.7 is out:
http://www.ebi.ac.uk/ena/about/cram_toolkit
CRAM is a reference based compression toolkit, which can shrink the size of BAM files from 2 fold up to 50 fold depending on the amount of tag information retained and the extent to which lossy compression is used.
CRAM 0.7 has the following functionality:
It is always has lossless storage of bases and read pair information, including long range read pairs. It can store non-reference based reads but without significant compression gains (as this is close to optimum).
It stores read group and mapping quality information per read
It can store both lossless quality data, or, if desired, lossy quality based compression.
If a lower entropy quality distribution is presented, it automatically compressed this better (ie, if quality score binning is used, it will take less space).
For lossy modes beyond binning, there are both a variety of pre-packed lossy quality modes, such as storing only the qualities at positions which are different from the reference, up to the ability to provide totally arbitrary per-read-position quality storage, which gives complete control of the loss of precision to the user.
It stores all the information in BAM header from a bam2cram
It supports arbitrary tags present in BAM. The support of arbitrary tags in CRAM does not imply that a CRAM based archive will store those tags - we would take "archive tags", currently those specified in the archive BAM specification.
It supports random access indexing to provide slices of the data, like BAM.
The cramtools library both acts as a command line tool (like picard/samtools) and it complies to most of the Picard read I/O interfaces, meaning that Java tools which have been written using Picard can have CRAM backends provided. We are working with the Picard developers to aim for even better plug in behaviour.
In short, at this point we believe that CRAM is "feature complete". We know a variety of people have tried CRAM (many thanks for those who have) and that has extensively influenced our development path. We are sure there are still some bugs to be found and squashed, but at this stage it should be in the position to be extensively tested, with no major show stopping bugs.
We would like to aim to have CRAM an eventual replacement for BAM for archive submission and display, and we are also working with our other archive partners (NCBI) to ensure that the CRAM representation of both lossless and lossy compression can be round tripped with the NCBI compression scheme (cSRA).
Kudos to Vadim, Rasko and Guy for putting this together, many thanks to the early adoptors/testers (james B, Klaus, Tony C, others) and please do email on the cram-dev list more information.
Ewan
I would ask why organizations that are dealing with thousands of samples are still relying on individual files, rather than a warehouse where it is trivial to pull up all reads from a given region, for instance.
GATK developers would ask why we need such a warehouse when they have the single-sample calling pipeline. Illumina/CG people would ask why we need such a warehouse if they can make very good variant calls. I would ask why we need such a warehouse if we can derive good assemblies.
And those running labs would ask who is going to pay for such a warehouse.
And grandma would ask why she should put her money in a bank when she has a perfectly good mattress.