Make bam index human readable
3
2
Entering edit mode
8.9 years ago

Hi All,

Is it possible to convert a bam index (.bai) to human readable format? I'm asking mostly for learning, I don't plan to do anything special with it.

There is a good description of the bam index in the sam spec document, session 5.2 but I would like to see a "real" example of an index in plain text.

Thanks

Dario

bam index bai • 5.7k views
ADD COMMENT
1
Entering edit mode

Are we excluding just opening one in a hex editor? :P

ADD REPLY
0
Entering edit mode

Like hexdump aln.bam.bai? No thanks!

ADD REPLY
0
Entering edit mode

Working on getting Integrated Genome Browser to display BAI files as genome graphs. Any advice or commentary you have time to contribute would be appreciated.

ADD REPLY
3
Entering edit mode
8.9 years ago

You can write such a tool yourself using R (just as an example):

> x = file('/Users/sdavis2/Downloads/vcfanno_0.0.7_darwin_386/example/ex.bam.bai')
> open(x,'rb')
> readChar(x,4) # Magic number
[1] "BAI\001"
> readBin(x,integer(),size=1,signed=FALSE) # Number of Ref Sequences
[1] 86

At this point, you can use loops to loop over each of the blocks, reading in data as described in the spec.

ADD COMMENT
0
Entering edit mode

Ahhh, I see. So I can read the binary and use the spec to correctly "interpret" the bytes I'm reading. I'll give it a try... Thank you.

ADD REPLY
0
Entering edit mode

It is a bit tedious to do the first time, but if you wrap this up as code and put things into R data structures as you go, it could up being both instructive and useful.

ADD REPLY
0
Entering edit mode

I ran this code a few weeks ago as an amusement and it worked. Now, my experiment actually requires to have a look at the BAM (and/or possibly BAI file) in the binary form. However, now, it keeps giving me an error 'object x not found'.

Is the same with BAM files when you are trying to make them human readable? And, also, can you please give me some more hints on how can I make my BAM human readable as I just started using programming languages. Thanks.

ADD REPLY
6
Entering edit mode
8.9 years ago

I wrote a tool to dump a BAI as XML using htsjdk : https://github.com/lindenb/jvarkit/wiki/Biostar172515

find DIR -name "*.bam" | xargs java -jar dist/biostar172515.jar  | xmllint --format -

<?xml version="1.0" encoding="UTF-8"?>
<bai-list>
<bam bam="DIR/exampleBAM.bam" has-index="true" n_ref="1">
    <reference ref-id="0" ref-name="chr1" ref-length="100000" n_aligned="33" n_bin="12" n_no_coor="0">
      <bin first-locus="1" last-locus="536870912" level="0" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="67108864" level="1" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="8388608" level="2" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="1048576" level="3" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="131072" level="4" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="16384" level="5" first-offset="828" n_chunk="1">
        <chunk chunk_beg="828" chunk_end="1963"/>
      </bin>
      <bin first-locus="16385" last-locus="32768" level="5" first-offset="1963" n_chunk="1">
        <chunk chunk_beg="1963" chunk_end="3323"/>
      </bin>
      <bin first-locus="32769" last-locus="49152" level="5" first-offset="3323" n_chunk="1">
        <chunk chunk_beg="3323" chunk_end="4687"/>
      </bin>
      <bin first-locus="49153" last-locus="65536" level="5" first-offset="4687" n_chunk="1">
        <chunk chunk_beg="4687" chunk_end="6501"/>
      </bin>
      <bin first-locus="65537" last-locus="81920" level="5" first-offset="0" n_chunk="0"/>
      <bin first-locus="81921" last-locus="98304" level="5" first-offset="6501" n_chunk="1">
        <chunk chunk_beg="6501" chunk_end="238223360"/>
      </bin>
      <bin first-locus="98305" last-locus="114688" level="5" first-offset="0" n_chunk="0"/>
    </reference>
  </bam>
  <bam bam="/DIR/exampleBAM2.bam" has-index="true" n_ref="1">
    <reference ref-id="0" ref-name="chr1" ref-length="100000" n_aligned="33" n_bin="12" n_no_coor="0">
      <bin first-locus="1" last-locus="536870912" level="0" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="67108864" level="1" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="8388608" level="2" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="1048576" level="3" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="131072" level="4" first-offset="0" n_chunk="0"/>
      <bin first-locus="1" last-locus="16384" level="5" first-offset="828" n_chunk="1">
        <chunk chunk_beg="828" chunk_end="1963"/>
(...)
      <bin first-locus="65537" last-locus="81920" level="5" first-offset="0" n_chunk="0"/>
      <bin first-locus="81921" last-locus="98304" level="5" first-offset="6501" n_chunk="1">
        <chunk chunk_beg="6501" chunk_end="238223360"/>
      </bin>
      <bin first-locus="98305" last-locus="114688" level="5" first-offset="0" n_chunk="0"/>
    </reference>
  </bam>
</bai-list>

see also hts_idx_load_core in htslib: https://github.com/samtools/htslib/blob/develop/hts.c

ADD COMMENT
0
Entering edit mode

Thank you Pierre!

ADD REPLY
1
Entering edit mode
3.9 years ago
baye.james ▴ 10

I had a go at this in Python. It's part of a repo I'm working on.

import struct
from collections import namedtuple

Idx = namedtuple("idx", ["magic", "n_ref", "refs", "n_no_coor"])
Ref = namedtuple("ref", ["n_bin", "bins", "n_intv", "intvs"])
Bin = namedtuple("bin", ["i_bin", "n_chunk", "chunks"])
Chunk = namedtuple("chunk", ["chunk_beg", "chunk_end"])
Intv = namedtuple("interval", ["ioffset"])

def read_bai(baifilepath):
    """Read BAI index file.

    Per-reference metadata (Magic bin number = 37450) is ignored.
    See https://samtools.github.io/hts-specs/SAMv1.pdf section 5.2.

    Args:
        baifilepath (str): Path to BAI file.
    Returns:
        A Idx namedtuple containing the parsed BAI data.
    """
    METADATA_IBIN = 37450 # pseudo-bin containing per-reference metadata. ignored.
    with open(baifilepath, "rb") as f:
        magic, n_ref = struct.unpack("4s i", f.read(4 + 4))
        refs = []
        for _ in range(n_ref):
            n_bin, = struct.unpack("i", f.read(4))
            bins = []
            for _ in range(n_bin):
                i_bin, n_chunk = struct.unpack("I i", f.read(4 + 4))
                chunks = []
                for _ in range(n_chunk):
                    chunk_beg, chunk_end = struct.unpack("Q Q", f.read(8 + 8))
                    chunks.append(Chunk(chunk_beg, chunk_end))
                if i_bin != METADATA_IBIN:
                    bins.append(Bin(i_bin, n_chunk, chunks))
            n_intv, = struct.unpack("i", f.read(4))
            intvs = []
            for _ in range(n_intv):
                ioffset, = struct.unpack("Q", f.read(8))
                intvs.append(Intv(ioffset))
            refs.append(Ref(n_bin, bins, n_intv, intvs))
        n_no_coor, = struct.unpack("Q", f.read(8))
        return Idx(magic, n_ref, refs, n_no_coor)
ADD COMMENT

Login before adding your answer.

Traffic: 1408 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6