The SAM specification states the following:
BGZF is block compression implemented on top of the standard gzip file format....a compliant gunzip utility can decompress a BGZF compressed file
I would like be able to read a BAM file using gzip, as the specification says I should be able to. However, when I open a BAM file using the inbuilt python library gzip I am able to read the header normally but the reads still look compressed in some manner, like so:
AMCBCZ0XDZ100SMiASi??'"?Icd(ZHS2000-943_70:2:2301:14923:121232@("("("("("("("("("("("("("("("("(""%%##%'&##'((&''($'((($'($&%&'%%&&'(("'#%#&'&&$&(&(( !%!ASi?'!" !!!!""BCZ0XDZ100SMiIAd?sHS2000-905_68:3:1102:3166:478590?B ?"?"?"?"?"?"?"?"?"("("("("("("("!"%%##%''''#''!'(&'%'((%$&&(('('"'!'&(&!'&"# %&#&&$"!! "BCZ0XDZ51^1$30^CCCT$18SMiASi?'"I?d'HS2000-943_70:2:1302:17909:144833"("("?"?"?"?"?"?"?"("(!"!"?$!(!!"!"!"%%# #'#%!#& #%$###%%"!'(#'"!H0CH1CH2CBCZ0)'"IYd????????HS2000-943_70:2:1302:17909:144833@($""(!?"?"$?"?"?"??""("("("("("("("!
I can see the making of a real read in there, (i.e the read name: HS2000-943_70:2:2301:14923:121232
) but most of it is illegible. Perhaps I should be reading in entire blocks and then decompressing further? I am not sure. I have also tried reading the file using the bgzf biopython module and receive the same results.
I am aware of the BioStars question here. However, I feel given the explicit declaration in the SAM file specification that the files are readable by gzip, the answers (to use pysam) was inadequate.
Thanks for reading!
Thank you. This makes sense.
Do you have any idea how I would go about doing this? I would quite like to do this as a learning exercise.
I assume that the samtools must do this somewhere so I suppose I'll start by inspecting the relevant part of samtools.
You'll definitely want to have a look at the samtools C code (use the version from sourceforge rather than github, since the former is more self contained). You'll end up making a bunch of custom structures that will parallel those used in the C code (and pysam) and then using f.read(some_number) to fill in the various components.
It's good that you're only doing this as an exercise, since it'd be quicker to just rewrite pysam :)