Why are we still using Bam files? And not Cram, HDF5 or improved Bam files?
5
29
Entering edit mode
10.4 years ago
WilliamS ▴ 320

The Bam format was developed at a time when I/O (disk / network) was not a limiting factor for genomics data analysis. Currently it is. Especially when you have pipelines like GATK that process BAM files in multiple steps and read and write bam files for every step (instead of in memory streaming).

I know raw sequencing data (SOLiD & PacBio) and other large (scientific, financial etc.) datasets are stored in HDF5. And there are initiatives like CRAM and reduced bam.

Still I believe most organizations are using regular bam files. Why is this? Is there a roadmap towards development (and adoption!) of improved formats for storing DNA alignment data?

cram bam io hdf • 33k views
ADD COMMENT
0
Entering edit mode

I would ask why organizations that are dealing with thousands of samples are still relying on individual files, rather than a warehouse where it is trivial to pull up all reads from a given region, for instance.

ADD REPLY
2
Entering edit mode

GATK developers would ask why we need such a warehouse when they have the single-sample calling pipeline. Illumina/CG people would ask why we need such a warehouse if they can make very good variant calls. I would ask why we need such a warehouse if we can derive good assemblies.

ADD REPLY
1
Entering edit mode

And those running labs would ask who is going to pay for such a warehouse.

ADD REPLY
10
Entering edit mode

And grandma would ask why she should put her money in a bank when she has a perfectly good mattress.

ADD REPLY
28
Entering edit mode
10.4 years ago
lh3 33k

Firstly, ReducedBAM is still BAM and it has been discontinued. Secondly, not streaming BAM is the problem of GATK, not BAM. A lot of other tools take the advantage of streaming, which SAM/BAM encourages by design. Thirdly, I think CRAM will replace BAM ultimately. CRAM has all the advantages of BAM (except that it is more complex) plus smaller file size and other additional features. With the official SAM/VCF/tabix libraries, htslib in C and htsjdk in Java, seamlessly supporting both, users will be able to use CRAM wherever BAM is applicable.

IMO, the single top reason that other formats (goby, cSRA, adam and biohdf) did not fly is because they are a bit late to the party. They all have their own advantages, but when most data are already in the BAM format and the vast majority of downstream tools accept BAMs only, there is not much room for a new format unless it is substantially better than BAM - not so easy ;-) CRAM wins out for now mainly because it is so close to BAM conceptually that it is easy for them to have the same APIs.

Other minor reasons include... a) processing BAM was significantly faster, though CRAM is catching up now. b) BAM is relatively simple and was fully spec'd from the very beginning. As a result, it has been implemented independently in a variety of programming languages (C, C++ (multiple times), java, lisp, go, D, haskell, javascript, ...). This greatly helps its adoption. c) the following seems a weird reason - SAM/BAM was first implemented in C with admittedly bad but still usable APIs. For a scripting language, binding to a C library is usually easier than binding to a library in C++, Java, etc. Perl, Python and R bindings played an important role in the early adoption of BAM. After all, most analysts work with scripting languages.

ADD COMMENT
4
Entering edit mode

I am going to paste the contents of the link https://sourceforge.net/p/samtools/mailman/message/28989211/ below as the SourceForge formatting (lack of wrapping) makes it difficult to read . Plus it is quite relevant to the current discussion

-------------------------- content of https://sourceforge.net/p/samtools/mailman/message/28989211/ ----------------------

ADD REPLY
8
Entering edit mode

This post is quite old. I need to give an update. Firstly, the C implementation of CRAM is only slightly slower than samtools on decompression but slightly faster on compression. The speed of decoding CRAM is not an concern any more. I don't know about the Java implementation. Secondly, goby 2.0 released after this post supports lossless compression. It compresses better but is several times slower on compression/decompression.

ADD REPLY
0
Entering edit mode

well said, plus of course the old "historical inertia" factor as Pierre points out in his answer. There is a lot to be said for sheer momentum.

ADD REPLY
0
Entering edit mode

Any javascript implementations of CRAM readers yet?

ADD REPLY
9
Entering edit mode
10.4 years ago

Neither of the solutions you have is a full alternative: HDF and CRAM are generic binary formats that could store any data. Neither is a specification (nor is it obvious that a SAM file transformed into HDF would actually be more performant than a BAM format). The reduced BAM format is not a generic format either, it is a simplified format that targets one specific use case.

ADD COMMENT
4
Entering edit mode

I think the CRAM referred to by the OP is sequencing data specific and designed to use less space than BAM. Spec at https://samtools.github.io/hts-specs/CRAMv2.1.pdf though 3.0 is expected soon.

ADD REPLY
0
Entering edit mode

Doesn't CRAM do away with read names? While that could be done with BAMs, it would normally seem to make CRAM somewhat lossy in comparison.

ADD REPLY
3
Entering edit mode

CRAM can be produced in a lossless manner keeping readnames and existing quality values

ADD REPLY
0
Entering edit mode

There is a specific format and API (BioHDF) to store alignments in HDF and there is a method to import a sam file or stream.

http://www.hdfgroup.org/projects/biohdf/api_docs/html/group__bioh5g__alignments.html

The documentation is from December 2011, so I am not sure if the project is still being developed.

ADD REPLY
0
Entering edit mode

BioHDF is currently not in development. See my response to the original question.

ADD REPLY
8
Entering edit mode
10.4 years ago

Why is this?

because, most tools (samtool 0.1.19), workflows... still use BAM.

you never change a winning team.

HDF5: I don't think it's the best tool to store data (you want Streaming , you want to avoid padding of data if the record is tool small, you want bgzip compression,... ). On a personnal side, I think HDF5 is a nightmare to program in C. Using Hdf5 To Store Bio-Data

oh and why don't people use XML or JSON instead of VCF? why don't people all use the same best programming language (which is java *) :-)

(*) troll.

ADD COMMENT
1
Entering edit mode

Still about HDF5, I agree it is not so easy to program it in C but it still offers many good points, including platform independence.

The coolest part I think is that you can embed many data sources in a single file, without having to define yet another custom format, with the tedious task of implementing a dedicated parser (XML is not an option for big data). Having a single file instead of a dozen of files greatly eases the end user's life. We took the HDF5 approach in the GATB project where we have to save de Bruijn graphs information in a single file.

Another point I forgot to mention: HDF5 provides several tools (like h5dump) that allows to have an ASCII view of the (binary) content of the HDF5 file; it is then possible to use classical command line commands (like awk) to do some post processing.

ADD REPLY
0
Entering edit mode

I've used the HDF Java api (which sits on top of the c api) to process HDF files without much of a problem.

ADD REPLY
7
Entering edit mode
10.3 years ago

As one of the people on the BioHDF project, I should probably chime in here.

For those who are unfamiliar, the simplest explanation of HDF5 is that it essentially allows you to store multidimensional arrays ('datasets') of data in a regular file. The arrays can be of one of our pre-determined types (that, unsurprisingly, closely match C types) or a user-created type which can be stored in the file. These arrays can be organized using a filesystem-like structure of what we call 'groups'. You can also annotate these groups, stored types, and datasets with what we call 'attributes', which are just small data elements (ints, strings, etc.). Everything is stored in a binary format which we publish. Nothing is stopping you from writing your own I/O library based on said format, but most people use our C library, either alone or wrapped in their favorite language, for access. I've occasionally heard of HDF5 described as a 'binary file format construction kit' since the semantics of the groups, datasets, etc. is up to the user.

The main reason that we stopped development on BioHDF was that the project ended and the ecosystem had already settled on BAM. My company (The HDF Group) is not a research lab and we are not biologists, so it would be difficult for us to push the project forward. The BioHDF code that exists is not completely finished and was basically a tech demo, so performance is not stellar. That said, HDF5 could conceivably be a useful storage medium for NGS data, with some caveats.

There would definitely be some benefits to HDF5 as an NGS data container:

  • HDF5 supports MPI-IO, so it would be easier to write parallel programs.
  • HDF5 has a built-in cache, which can make I/O more performant (depends on I/O pattern).
  • HDF5 is in wide use so existing data analysis tools (Matlab, etc.) could read the files.
  • HDF5 almost certainly scales better than any flat format.
  • HDF5 is supported by a company, so there's a help desk you can call/email, etc.
  • HDF5 supports flexible and heterogeneous data storage. For example, you could have a core set of datasets and groups that make up a 'schema' (HDF5 doesn't support formal schemas at this time) and individual vendors and labs could add their own data objects without disturbing queries based on the core. You could do this without coordinating with a central authority - any extra data would just be ignored by a reader that only understood the core schema. This also means that you could do nice things like creating your own indexes and storing them the file.
  • HDF5 is probably going to be around for a long time. We spend a LOT of time making sure our file format and tools are both backward- and (to a certain extent) forward-compatible. NASA is a huge supporter of us and they keep data forever.
  • The compression scheme is flexible. We support compression plugins via an API and you can use different compression on different datasets.

Some downsides of HDF5 as an NGS storage format (aside from the ecosystem thing):

  • Variable-length string storage is not compressed. Regular arrays of strings are compressed, but VL strings are stored as pointers into a different file structure that is not compressed in the current version of the library. For NGS reads that have a particular length, this is not a problem. Unfortunately, for other types of string data that are not so regular you either have to over-specify the string length or store the strings in a concatenated 1D dataset if you want compression (this is why the default format in BioHDF is so slow, btw). This is a fixable problem, however. We just haven't had the resources (time, $$$) to fix this.
  • The C API is really close to the metal and has a steep learning curve. We do have a high-level API that's easier to use, as well as Java and Python bindings (h5py), though, and I would expect that an HDF5-backed NGS storage scheme would have its own, easier-to-use, API.
  • Subsets of BAM files can be obtained using ftp's ability to download n bytes of data at a particular file offset, which is very helpful for NGS data browsers. You can't do this with HDF5 due to the file format's complexity. We do have an HDF5 server in the works, but that project will need some more development time before the server is robust enough for public Internet use.
  • I get the impression that a lot of pipelines stream BAM files to SAM and then parse the output. If you are simply moving a linear sequence of bytes from point A to point B many of the features of the HDF5 file format and library are just overhead.

So that's my two cents on using HDF5 as an NGS format. It's certainly doable and we'd love to work with anyone who has the clout to push such a thing forward (info@hdfgroup.org). Breaking into the BAM ecosystem would be tough, but it could be done with enough outreach and resources, I think.

One last thing: I'd be remiss if I didn't mention SeqDB here. It's an HDF5-backed replacement for FASTQ files and shows that you can use HDF5 for genomic data with good performance.

http://www.ncbi.nlm.nih.gov/pubmed/23702558

https://bitbucket.org/mhowison/seqdb

Actually, there's even one more thing I should mention - Maintainers of binary formats occasionally switch to using us as the underlying storage format when they get sick of worrying about things like platform-independence and scalability. netCDF is an example of this. They switched to using HDF5 in version 4 of the format. I asked Heng about this once, but he said that (at the time) most pipelines that he knew of simply streamed BAM files to SAM and didn't use the C API, making such a thing less useful.

ADD COMMENT
1
Entering edit mode

Man, this post makes me so sad. :(

HDF5 is such an awesome format for data that is constantly used. Why you guys chose to pick BAM files, the archive format for genomic data, to focus on replacing rather than the bigwig format or a whole new, multidimensional, parallel IO, compressed yes random-access format for storing downstream results, I'll never know.

HDF5 and other generic file formats will never be able to compete in the archive-space. The more you tune your archive format, and the more crazy complex it becomes as a result, the more compact it can store data.... (and the slower it becomes to read/write). But that's fine for an archive format - and thats why BAM and ultimately CRAM will always trump everything else.

But if BioHDF5 had focused on storing processed data, we wouldn't now be stuck with the BigWig/BedGraph formats for our down-stream analysis. I wouldn't have to deal with god-awfully slow Python bindings for BigWig. I wouldn't be doing analysis on a cluster using a format designed for web browsers to access. How often do we do high throughput analysis on a web-browser? Even a simple gzipped struct is both more performant and smaller than BigWig, yet here we find ourselves - with HDF5's RESTful server no where to be seen, all because data utopia flowered in the middle of winter.

ADD REPLY
2
Entering edit mode

BAM is simpler, smaller and faster to process in comparison to hdf5. Cram is complex, but it is as fast as BAM and even smaller. A well-designed specialized binary format often outperforms HDF5 all around. Also note that PacBio, which has been storing signals and base calls in HDF5 for several years, is switching to BAM. This is welcomed by many PacBio users even at a great cost of pipeline refactoring.

I rarely use BigWig, but IIRC, no other lossless formats are as capable as BigWig. They may be smaller and faster to process, but they lack indexing and remote access. They can't replace BigWig.

I guess a potential issue with BigWig is that it is not spec'd and did not have a standalone implementation until very recently. If it had started out as BAM, it would have been cleaner.

ADD REPLY
1
Entering edit mode

There are non-slow bindings for bigWig files :)

But yeah, the actual bigWig format is pretty crazy given what it's used for.

ADD REPLY
2
Entering edit mode
10.3 years ago
Pavel Senin ★ 1.9k

Personally I think, that people use whatever works for their task (i.e. effective and efficient), I am with Pierre here. So to summarize, you may own all the bells and whistles, but you shall make them work and prove them better. It was for long known the response by L. Torvalds towards "the better specs" and the practical functionality here http://blog.codinghorror.com/dysfunctional-specifications/ - i.e. "...When reality and specs clash, the spec has zero meaning. Zilch. Nada. None...."

ADD COMMENT

Login before adding your answer.

Traffic: 2534 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6