Question

BAM to CRAM to BAM

3

Entering edit mode

7.5 years ago

ElUretsky ▴ 30

I would like to know, can someone post a working example of converting a BAM file to CRAM and BAM again?

I tried samtools, cramtools but each time, the BAM tags get mangled. Can someone show me how to get:

samtools view file1.bam | md5sum 
ed071dff8d60c74eefd1d14eb1f4bb21

then convert to CRAM, back to bam file2.bam such that:

samtools view file2.bam | md5sum 
ed071dff8d60c74eefd1d14eb1f4bb21

Sorry if this is trivial but I have spent a few hours on this.

bam cram • 5.9k views

ADD COMMENT • link updated 7.4 years ago by James Bonfield ▴ 170 • written 7.5 years ago by ElUretsky ▴ 30

0

Entering edit mode

Unfortunately, due to the number of different ways the same data can be stored in SAM/BAM/CRAM, checksums can not be used as a proxy for knowing what data is stored. Slight variations in compression would change the checksum without touching the data at all. SAM -> BAM and BAM -> SAM are not entirely lossless, and neither is BAM -> CRAM. I doubt it was even ever intended to be so.

Typically with user-definable properties you ensure these properties are sorted to prevent this problem (and others) from occurring, but bioinformatics is anything but typical ^_^

ADD REPLY • link 7.4 years ago by John 13k

6

Entering edit mode

7.4 years ago

James Bonfield ▴ 170

CRAM doesn't entirely know whether to generate MD and NM tags or not. It'll either always generate them or never generate them. The RG tag is also stored differently in CRAM and gets added on at the end, which causes it to change position. This is valid as the order is stated to be meaningless in SAM, but it makes it hard to validate.

It is possible, just about, to get RG/NM/MD working correctly, but it's a fudge. You can get the RG, MD and NM tags stored verbatim in the CRAM file and then ask for CRAM not to autogenerate them. I don't think we have a way to do this in samtools though and it's a bit of a contrived example too. If you really want to try that though (I don't know what the difference you are seeing is - an example would help!) then try "scramble -P" (scramble from Staden io_lib), and either "scramble" or "samtools view --input-fmt-options decode_md=0" to decode without the MD/NM tags.

There are also scramble -q (don't add @PG header lines) and -p (preserve BAM i vs s vs c (etc) aux size) options, which may help too, but you wouldn't be able to notice these differences in your view | md5sum above.

Unfortunately there are other issues too that CRAM lacks perfect round-trip information on. It always fills out the pnext/rnext/tlen fields when it can and can also in some situations fix invalid information in the flags, much like samtools fixmates would. It's arguable whether that is a bug or not!

Finally, if you're just trying to validate the round-trip, perhaps it is better to use a format aware comparison function instead of md5sum so that changes that are valid within the specification (reordering) don't show up as a problem. Htslib tests come with a compare_sam.pl script for this purpose. It may also help to canonicalise your input first - eg running samtools fixmates to ensure it doesn't contain invalid values.

ADD COMMENT • link 7.4 years ago by James Bonfield ▴ 170

score 3 · Accepted Answer · 2017-07-11

3

Entering edit mode

7.5 years ago

Istvan Albert 102k

I think the source of the problem is that the BAM specification does not require or prescribe an order to the tags, hence these can be reordered and still represent the same information. I would contact the SAM developers and see what could be done about this:

https://github.com/samtools/

ADD COMMENT • link 7.5 years ago by Istvan Albert 102k