Why are we still using Bam files? And not Cram, HDF5 or improved Bam files?
5
29
Entering edit mode
10.8 years ago
WilliamS ▴ 320

The Bam format was developed at a time when I/O (disk / network) was not a limiting factor for genomics data analysis. Currently it is. Especially when you have pipelines like GATK that process BAM files in multiple steps and read and write bam files for every step (instead of in memory streaming).

I know raw sequencing data (SOLiD & PacBio) and other large (scientific, financial etc.) datasets are stored in HDF5. And there are initiatives like CRAM and reduced bam.

Still I believe most organizations are using regular bam files. Why is this? Is there a roadmap towards development (and adoption!) of improved formats for storing DNA alignment data?

cram bam io hdf • 35k views
ADD COMMENT
0
Entering edit mode

I would ask why organizations that are dealing with thousands of samples are still relying on individual files, rather than a warehouse where it is trivial to pull up all reads from a given region, for instance.

ADD REPLY
2
Entering edit mode

GATK developers would ask why we need such a warehouse when they have the single-sample calling pipeline. Illumina/CG people would ask why we need such a warehouse if they can make very good variant calls. I would ask why we need such a warehouse if we can derive good assemblies.

ADD REPLY
1
Entering edit mode

And those running labs would ask who is going to pay for such a warehouse.

ADD REPLY
10
Entering edit mode

And grandma would ask why she should put her money in a bank when she has a perfectly good mattress.

ADD REPLY
28
Entering edit mode
10.8 years ago
lh3 33k

Firstly, ReducedBAM is still BAM and it has been discontinued. Secondly, not streaming BAM is the problem of GATK, not BAM. A lot of other tools take the advantage of streaming, which SAM/BAM encourages by design. Thirdly, I think CRAM will replace BAM ultimately. CRAM has all the advantages of BAM (except that it is more complex) plus smaller file size and other additional features. With the official SAM/VCF/tabix libraries, htslib in C and htsjdk in Java, seamlessly supporting both, users will be able to use CRAM wherever BAM is applicable.

IMO, the single top reason that other formats (goby, cSRA, adam and biohdf) did not fly is because they are a bit late to the party. They all have their own advantages, but when most data are already in the BAM format and the vast majority of downstream tools accept BAMs only, there is not much room for a new format unless it is substantially better than BAM - not so easy ;-) CRAM wins out for now mainly because it is so close to BAM conceptually that it is easy for them to have the same APIs.

Other minor reasons include... a) processing BAM was significantly faster, though CRAM is catching up now. b) BAM is relatively simple and was fully spec'd from the very beginning. As a result, it has been implemented independently in a variety of programming languages (C, C++ (multiple times), java, lisp, go, D, haskell, javascript, ...). This greatly helps its adoption. c) the following seems a weird reason - SAM/BAM was first implemented in C with admittedly bad but still usable APIs. For a scripting language, binding to a C library is usually easier than binding to a library in C++, Java, etc. Perl, Python and R bindings played an important role in the early adoption of BAM. After all, most analysts work with scripting languages.

ADD COMMENT
4
Entering edit mode

I am going to paste the contents of the link https://sourceforge.net/p/samtools/mailman/message/28989211/ below as the SourceForge formatting (lack of wrapping) makes it difficult to read . Plus it is quite relevant to the current discussion

-------------------------- content of https://sourceforge.net/p/samtools/mailman/message/28989211/ ----------------------

Hello Ewan et al,

Recently I met several questions related to format and compression. Thus I decided to do some experiments myself as I found I did not know much myself.

I am looking at the first 30 million reads from SRR065390, which is the whole-genome resequence data from C. elegans. I chose C. elegans, instead of a human chromosome, to see the effect of unmapped reads. I aligned the paired-end reads with BWA and only retained the following tags: RG, AM, SM, NM and MD. The BAM file size is about 1.7GB.

I tried the following formats/programs: samtools, picard (1.63; java 1.6.0_25), BioHDF (0.4a), Goby (1.9.8.3.1), cramtools (~0.7, cloned a coupled of days ago), gzip, bzip2 and pbzip2. I run picard mainly to figure out how much time cramtools spends on BAM I/O. I am running them on an old Linux server. Probably I'd better try a more recent machine in case of non-typical sw/hw configuration. Anyway, here are some crude numbers:

Id   cpuTime   realTime  outputSize   BriefCommandLine
=========================================================================================
(1)  582       599       1766193610   samtools view -bS  # SAM=>BAM
(2)  546       576       1766193610   samtools view -b   # BAM=>BAM
(3)  39        39        N/A          samtools index     # indexing BAM (reading through)
(4)  497       600       1814906779   SamFormatConverter.jar I=aln.bam SILENT # BAM=>BAM
(5)  1001      1052      1003208626   cram --all-{qual,tags} --unmapped # BAM=>CRAM
(6)  2132      2088      1642777948   cramtools.jar bam  # CRAM=>BAM

(7)  5793      8711      2814984166   bioh5g_import_alignments # SAM=>BioHDF; failed
(8)  2134      1987      317646256    goby.jar -m stc -p threshold=1 # SAM=>Goby

(9)  597       620       1644333705   gzip -c   # SAM=>SAM.gz
(10) 70        89        N/A          gzip -dc  # SAM.gz=>SAM
(11) 2065      2076      1376445754   bzip2 -c  # SAM=>SAM.bz2
(12) 372       373       N/A          bzip2 -dc # SAM.bz2=>SAM
(13) 2059      256       1378699551   pbzip2 -cp8  # SAM=>SAM.bz2, 8 threads
(14) 413       50        N/A          pbzip2 -dcp8 # SAM.bz2=>SAM, 8 threads

(15) 164       165       N/A          samtools view  # BAM=>SAM, 2.6GB exome BAM
(16) 55        56        N/A          samtools index # indexing, 2.6GB exome BAM
(17) 1517      1540      N/A          sam-dump -u    # cSRA=>SAM, 1.8GB exome cSRA

(18) 140       141       N/A          samtools depth > /dev/null  # per-base depth
(19) 4903      4815      N/A          mpileup -C50 -Euf|bcftools  # SNP+INDEL calling

My interpretation to the results:

a) Cramtools first. I am running cramtools in largely the lossless mode, but with most read names converted to integers. This is why BAM converted back from CRAM is smaller [(6) vs. (1)]. The Cram format is superior to BAM in terms of file size. It is 40% smaller on disk [=(1.64-1.00)/1.64, ref. (5) and (6)] in the lossless mode and will be much smaller in the lossy mode. The BAM=>CRAM conversion is very fast, only twice as slow as doing BAM=>BAM conversion [(4) vs. (5)]. However, my major concern is the slow CRAM=>BAM conversion. Cramtools takes about 1635 sec [=2132-497, ref. (4) and (6)] to reconstruct the alignment information excluding the time spent on BAM writing. Samtools only takes 39 sec [ref. (3)] to read through the entire BAM. Is it a good way to measure the the speed of Cram decoding? Am I using cramtools the best way or is my result typical? If the timing is about right, is it possible to substantially optimize the CRAM=>BAM conversion? Although samtools SNP calling is slower than Cram=>BAM conversion [ref. (19)], simple tasks such as computing depth and collecting basic statistics on BAM will be by far slower with Cram [ref. (18)], which is worrying. In addition, the most time consuming step is BAQ, which may be replaced by better strategies and be naturally deprecated by longer reads in future.

b) Then cSRA. The BAM<=>cSRA conversion is now available in the SRAtoolkit. The documentation is sparse at the moment. I do not know how to do the BAM=>cSRA conversion. So I downloaded existing human exome files here:

ftp://ftp.ncbi.nih.gov/1000genomes/ftp/phase1/technical/ncbi_varpipe_data/compression_analysis/DATA/FIN/HG00311.mapped.illumina.mosaik.FIN.exome.20110411/no-quantized.{csra,bam}

to evaluate the decoding speed. cSRA is also superior to BAM in terms of file size (1.8G vs 2.6G), but is also much slower for cSRA=>SAM conversion [ref. (15) and (17)].

c) BioHDF reports errors at the end of conversion. The possibly premature file size is larger than BAM. Nonetheless, BioHDF is editable, while BAM is static.

d) The Goby format. The Goby output is much smaller than others. But from the Goby=>SAM output, it seems that Goby discards base quality, read names together with pairing, optional tags and also CIGARs. Thus goby may be an efficient intermediate format for variant calling, but does not serve the purpose of archiving. In addition, when doing Goby=>SAM conversion, only a small fraction of alignments are outputted. I do not know why. The generated Goby file is rejected by IGV.

e) Compression algorithms. There have been quite a few threads on the compression algorithms. There is nothing new here in my table: gzip is much faster than bzip2, which is why we chose zlib instead of bzlib to compress BAM.

I am mostly worrying whether I wrongly used some programs in the table. I apologize if there are major errors and hope others can correct me.

If the table is about right, my summary is that the advantage of BAM is its relatively simple structure and fast decoding. Both cSRA and Cram have the advantage of much smaller file sizes, which will be significant when lossy compression is enabled. At present, Cram seems more mature than cSRA. Goby and BioHDF have unique features of their own, but they might not be ready for general production uses.

Looking forward, I think BAM needs a revamp ultimately given the increasing read lengths and accumulating flaws in the initial design. Cram right now seems the best candidate as a replacement of BAM. On the other hand, my opinion is the first priority of the cram development is to optimize the decoding speed. The current decoding speed may be a concern for a large project like 1000g.

I have not decided what is the best co-development model between cramtools and samtools. This needs more thoughts.

regards,

Heng

On Mar 8, 2012, at 12:43 PM, Ewan Birney wrote:

I am not sure that everyone here is on the cramtools mailing list, so just to say that CRAM 0.7 is out:

http://www.ebi.ac.uk/ena/about/cram_toolkit

CRAM is a reference based compression toolkit, which can shrink the size of BAM files from 2 fold up to 50 fold depending on the amount of tag information retained and the extent to which lossy compression is used.

CRAM 0.7 has the following functionality:

  • It is always has lossless storage of bases and read pair information, including long range read pairs. It can store non-reference based reads but without significant compression gains (as this is close to optimum).

  • It stores read group and mapping quality information per read

  • It can store both lossless quality data, or, if desired, lossy quality based compression.

  • If a lower entropy quality distribution is presented, it automatically compressed this better (ie, if quality score binning is used, it will take less space).

  • For lossy modes beyond binning, there are both a variety of pre-packed lossy quality modes, such as storing only the qualities at positions which are different from the reference, up to the ability to provide totally arbitrary per-read-position quality storage, which gives complete control of the loss of precision to the user.

  • It stores all the information in BAM header from a bam2cram

  • It supports arbitrary tags present in BAM. The support of arbitrary tags in CRAM does not imply that a CRAM based archive will store those tags - we would take "archive tags", currently those specified in the archive BAM specification.

  • It supports random access indexing to provide slices of the data, like BAM.

  • The cramtools library both acts as a command line tool (like picard/samtools) and it complies to most of the Picard read I/O interfaces, meaning that Java tools which have been written using Picard can have CRAM backends provided. We are working with the Picard developers to aim for even better plug in behaviour.

In short, at this point we believe that CRAM is "feature complete". We know a variety of people have tried CRAM (many thanks for those who have) and that has extensively influenced our development path. We are sure there are still some bugs to be found and squashed, but at this stage it should be in the position to be extensively tested, with no major show stopping bugs.

We would like to aim to have CRAM an eventual replacement for BAM for archive submission and display, and we are also working with our other archive partners (NCBI) to ensure that the CRAM representation of both lossless and lossy compression can be round tripped with the NCBI compression scheme (cSRA).

Kudos to Vadim, Rasko and Guy for putting this together, many thanks to the early adoptors/testers (james B, Klaus, Tony C, others) and please do email on the cram-dev list more information.

http://listserver.ebi.ac.uk/mailman/listinfo/cram-dev

Ewan

ADD REPLY
8
Entering edit mode

This post is quite old. I need to give an update. Firstly, the C implementation of CRAM is only slightly slower than samtools on decompression but slightly faster on compression. The speed of decoding CRAM is not an concern any more. I don't know about the Java implementation. Secondly, goby 2.0 released after this post supports lossless compression. It compresses better but is several times slower on compression/decompression.

ADD REPLY
0
Entering edit mode

well said, plus of course the old "historical inertia" factor as Pierre points out in his answer. There is a lot to be said for sheer momentum.

ADD REPLY
0
Entering edit mode

Any javascript implementations of CRAM readers yet?

ADD REPLY
9
Entering edit mode
10.8 years ago

Neither of the solutions you have is a full alternative: HDF and CRAM are generic binary formats that could store any data. Neither is a specification (nor is it obvious that a SAM file transformed into HDF would actually be more performant than a BAM format). The reduced BAM format is not a generic format either, it is a simplified format that targets one specific use case.

ADD COMMENT
4
Entering edit mode

I think the CRAM referred to by the OP is sequencing data specific and designed to use less space than BAM. Spec at https://samtools.github.io/hts-specs/CRAMv2.1.pdf though 3.0 is expected soon.

ADD REPLY
0
Entering edit mode

Doesn't CRAM do away with read names? While that could be done with BAMs, it would normally seem to make CRAM somewhat lossy in comparison.

ADD REPLY
3
Entering edit mode

CRAM can be produced in a lossless manner keeping readnames and existing quality values

ADD REPLY
0
Entering edit mode

There is a specific format and API (BioHDF) to store alignments in HDF and there is a method to import a sam file or stream.

http://www.hdfgroup.org/projects/biohdf/api_docs/html/group__bioh5g__alignments.html

The documentation is from December 2011, so I am not sure if the project is still being developed.

ADD REPLY
0
Entering edit mode

BioHDF is currently not in development. See my response to the original question.

ADD REPLY
8
Entering edit mode
10.8 years ago

Why is this?

because, most tools (samtool 0.1.19), workflows... still use BAM.

you never change a winning team.

HDF5: I don't think it's the best tool to store data (you want Streaming , you want to avoid padding of data if the record is tool small, you want bgzip compression,... ). On a personnal side, I think HDF5 is a nightmare to program in C. Using Hdf5 To Store Bio-Data

oh and why don't people use XML or JSON instead of VCF? why don't people all use the same best programming language (which is java *) :-)

(*) troll.

ADD COMMENT
1
Entering edit mode

Still about HDF5, I agree it is not so easy to program it in C but it still offers many good points, including platform independence.

The coolest part I think is that you can embed many data sources in a single file, without having to define yet another custom format, with the tedious task of implementing a dedicated parser (XML is not an option for big data). Having a single file instead of a dozen of files greatly eases the end user's life. We took the HDF5 approach in the GATB project where we have to save de Bruijn graphs information in a single file.

Another point I forgot to mention: HDF5 provides several tools (like h5dump) that allows to have an ASCII view of the (binary) content of the HDF5 file; it is then possible to use classical command line commands (like awk) to do some post processing.

ADD REPLY
0
Entering edit mode

I've used the HDF Java api (which sits on top of the c api) to process HDF files without much of a problem.

ADD REPLY
7
Entering edit mode
10.6 years ago

As one of the people on the BioHDF project, I should probably chime in here.

For those who are unfamiliar, the simplest explanation of HDF5 is that it essentially allows you to store multidimensional arrays ('datasets') of data in a regular file. The arrays can be of one of our pre-determined types (that, unsurprisingly, closely match C types) or a user-created type which can be stored in the file. These arrays can be organized using a filesystem-like structure of what we call 'groups'. You can also annotate these groups, stored types, and datasets with what we call 'attributes', which are just small data elements (ints, strings, etc.). Everything is stored in a binary format which we publish. Nothing is stopping you from writing your own I/O library based on said format, but most people use our C library, either alone or wrapped in their favorite language, for access. I've occasionally heard of HDF5 described as a 'binary file format construction kit' since the semantics of the groups, datasets, etc. is up to the user.

The main reason that we stopped development on BioHDF was that the project ended and the ecosystem had already settled on BAM. My company (The HDF Group) is not a research lab and we are not biologists, so it would be difficult for us to push the project forward. The BioHDF code that exists is not completely finished and was basically a tech demo, so performance is not stellar. That said, HDF5 could conceivably be a useful storage medium for NGS data, with some caveats.

There would definitely be some benefits to HDF5 as an NGS data container:

  • HDF5 supports MPI-IO, so it would be easier to write parallel programs.
  • HDF5 has a built-in cache, which can make I/O more performant (depends on I/O pattern).
  • HDF5 is in wide use so existing data analysis tools (Matlab, etc.) could read the files.
  • HDF5 almost certainly scales better than any flat format.
  • HDF5 is supported by a company, so there's a help desk you can call/email, etc.
  • HDF5 supports flexible and heterogeneous data storage. For example, you could have a core set of datasets and groups that make up a 'schema' (HDF5 doesn't support formal schemas at this time) and individual vendors and labs could add their own data objects without disturbing queries based on the core. You could do this without coordinating with a central authority - any extra data would just be ignored by a reader that only understood the core schema. This also means that you could do nice things like creating your own indexes and storing them the file.
  • HDF5 is probably going to be around for a long time. We spend a LOT of time making sure our file format and tools are both backward- and (to a certain extent) forward-compatible. NASA is a huge supporter of us and they keep data forever.
  • The compression scheme is flexible. We support compression plugins via an API and you can use different compression on different datasets.

Some downsides of HDF5 as an NGS storage format (aside from the ecosystem thing):

  • Variable-length string storage is not compressed. Regular arrays of strings are compressed, but VL strings are stored as pointers into a different file structure that is not compressed in the current version of the library. For NGS reads that have a particular length, this is not a problem. Unfortunately, for other types of string data that are not so regular you either have to over-specify the string length or store the strings in a concatenated 1D dataset if you want compression (this is why the default format in BioHDF is so slow, btw). This is a fixable problem, however. We just haven't had the resources (time, $$$) to fix this.
  • The C API is really close to the metal and has a steep learning curve. We do have a high-level API that's easier to use, as well as Java and Python bindings (h5py), though, and I would expect that an HDF5-backed NGS storage scheme would have its own, easier-to-use, API.
  • Subsets of BAM files can be obtained using ftp's ability to download n bytes of data at a particular file offset, which is very helpful for NGS data browsers. You can't do this with HDF5 due to the file format's complexity. We do have an HDF5 server in the works, but that project will need some more development time before the server is robust enough for public Internet use.
  • I get the impression that a lot of pipelines stream BAM files to SAM and then parse the output. If you are simply moving a linear sequence of bytes from point A to point B many of the features of the HDF5 file format and library are just overhead.

So that's my two cents on using HDF5 as an NGS format. It's certainly doable and we'd love to work with anyone who has the clout to push such a thing forward (info@hdfgroup.org). Breaking into the BAM ecosystem would be tough, but it could be done with enough outreach and resources, I think.

One last thing: I'd be remiss if I didn't mention SeqDB here. It's an HDF5-backed replacement for FASTQ files and shows that you can use HDF5 for genomic data with good performance.

http://www.ncbi.nlm.nih.gov/pubmed/23702558

https://bitbucket.org/mhowison/seqdb

Actually, there's even one more thing I should mention - Maintainers of binary formats occasionally switch to using us as the underlying storage format when they get sick of worrying about things like platform-independence and scalability. netCDF is an example of this. They switched to using HDF5 in version 4 of the format. I asked Heng about this once, but he said that (at the time) most pipelines that he knew of simply streamed BAM files to SAM and didn't use the C API, making such a thing less useful.

ADD COMMENT
1
Entering edit mode

Man, this post makes me so sad. :(

HDF5 is such an awesome format for data that is constantly used. Why you guys chose to pick BAM files, the archive format for genomic data, to focus on replacing rather than the bigwig format or a whole new, multidimensional, parallel IO, compressed yes random-access format for storing downstream results, I'll never know.

HDF5 and other generic file formats will never be able to compete in the archive-space. The more you tune your archive format, and the more crazy complex it becomes as a result, the more compact it can store data.... (and the slower it becomes to read/write). But that's fine for an archive format - and thats why BAM and ultimately CRAM will always trump everything else.

But if BioHDF5 had focused on storing processed data, we wouldn't now be stuck with the BigWig/BedGraph formats for our down-stream analysis. I wouldn't have to deal with god-awfully slow Python bindings for BigWig. I wouldn't be doing analysis on a cluster using a format designed for web browsers to access. How often do we do high throughput analysis on a web-browser? Even a simple gzipped struct is both more performant and smaller than BigWig, yet here we find ourselves - with HDF5's RESTful server no where to be seen, all because data utopia flowered in the middle of winter.

ADD REPLY
2
Entering edit mode

BAM is simpler, smaller and faster to process in comparison to hdf5. Cram is complex, but it is as fast as BAM and even smaller. A well-designed specialized binary format often outperforms HDF5 all around. Also note that PacBio, which has been storing signals and base calls in HDF5 for several years, is switching to BAM. This is welcomed by many PacBio users even at a great cost of pipeline refactoring.

I rarely use BigWig, but IIRC, no other lossless formats are as capable as BigWig. They may be smaller and faster to process, but they lack indexing and remote access. They can't replace BigWig.

I guess a potential issue with BigWig is that it is not spec'd and did not have a standalone implementation until very recently. If it had started out as BAM, it would have been cleaner.

ADD REPLY
1
Entering edit mode

There are non-slow bindings for bigWig files :)

But yeah, the actual bigWig format is pretty crazy given what it's used for.

ADD REPLY
2
Entering edit mode
10.6 years ago
Pavel Senin ★ 1.9k

Personally I think, that people use whatever works for their task (i.e. effective and efficient), I am with Pierre here. So to summarize, you may own all the bells and whistles, but you shall make them work and prove them better. It was for long known the response by L. Torvalds towards "the better specs" and the practical functionality here http://blog.codinghorror.com/dysfunctional-specifications/ - i.e. "...When reality and specs clash, the spec has zero meaning. Zilch. Nada. None...."

ADD COMMENT

Login before adding your answer.

Traffic: 3116 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6