parsing BAM index (quick&dirty samtools view X.bam region)
2
1
Entering edit mode
9.3 years ago
Darked89 4.7k

Hello,

The task is as follows:

Given a local BAM index file (BAI) plus fasta reference and a remote BAM, produce selected region only samtools-readable BAM using curl with SSL & certificates. Minimize compression/decompression, just get the bgzf-compressed chunks containing region of interest (plus some other reads outside the region), perform the exact regions selection on the local drive.

The curl-less variant of it:

Using BAI + fasta reference, translate region (i.e. chr1:2000-4000) into binary positions of the corresponding BAM, save as text. Dump the BAM header + bgzf-compressed chunks creating samtools-readable micro-BAM, i.e. using dd.

There are multiple libraries/programs implementing necessary functions (samtools[C], htsjdk[java], sambamba/bioD[D], even BioGO[go]). My problem is that because I do understand enough of the BAM/BAI file specifications, I have problem understanding what exactly a given chunk of code does, especially that I am coming from the Python camp.

One possible shortcut is to use htsjdk and access the necessary classes/methods using either jython or groovy. In jython 2.7.0 after putting in Jython registry:

python.security.respectJavaAccessibility = false
#in the jython comandline:
import sys
sys.path.append("/usr/local/soft/picard_1.138/htsjdk-1.138.jar")
from htsjdk.samtools import *
from java.io import File
bai_fh = File("./A.bam.bai")
mydict = SAMSequenceDictionary()
bai_foo = DiskBasedBAMFileIndex(bai_fh, mydict)

At this stage I am able to use bai_foo.getNumberOfReferences() etc., but the desired method getBinsOverlapping(int referenceIndex, int startPos, int endPos) is in BrowseableBAMIndex interface. And in this maze of classes, methods, interfaces I am getting lost.

So my questions are:

0. not explored but important: given that BAM file starts with compressed, mixed text and bin values header, I hope it is enough to simply dump everything from the BAM start until the start of the first alignment/unaligned sequences block. What is the correct method to get this number?

1. Given that htsjdk is accessible from Jython/groovy, what is the path to access the getBinsOverlapping?

2. Since I am not restricted to above libs/languages, any hack based on bioD or BioGO, Python or C etc. will be great

Many thanks for your help

Darek Kedra

Edit: The missing groovy part

groovy -cp libs/htsjdk-1.138.jar test_htsjdk.groovy
###### test_htsjdk.groovy ######
#!/usr/bin/env groovy

import htsjdk.samtools.*

File bam_fh =  new File("./A.bam")
File bai_fh =  new File("./A.bam.bai")

def mydict = new SAMSequenceDictionary()
def bai_foo = new DiskBasedBAMFileIndex(bai_fh, mydict)
println bai_foo.getNumberOfReferences()
jython BAM picard htsjdk samtools • 5.3k views
ADD COMMENT
2
Entering edit mode

The SAM/BAM spec gives C source code to compute bins overlapping a region. You don't need to call getBinsOverlapping() if it is hard. Just convert the 12-line C function to a Python function.

ADD REPLY
0
Entering edit mode

I remember doing exactly this, and the C function maps almost directly to Python. Thanks for the good spec.

ADD REPLY
1
Entering edit mode

I don't understand. You want `samtools view "http://host/remote.bam" chr123:145-6789` .

It's already implemented isn't it ?

ADD REPLY
0
Entering edit mode

@Pierre: yes. But so far not for encrypted connections requiring certificates.

ADD REPLY
0
Entering edit mode

As a test, can you give us any bam file behind a https server?

ADD REPLY
0
Entering edit mode

@Pierre: I am behind a firewall, so unfortunately no toy example easy to create for people from outside. To make things worse, the certs issued on my workstation are not exactly the same as ones certified by some cert authority. And the real killer is that the real data serving hub filters on IP numbers.

I will be happy to have a brain dead utility outputting just two positions inside the BAM file for curl (or dd) to fetch/dump. If samtools view some_chunk.bam will print out the all the reads without crashing I am fine. And because I have meta info about genome assembly version inside of every BAM I may need, getting some generic header just that samtools view can work is not a problem, I think. Also because the regions of interest at this stage are SNPs/small indels, the size of a region will be 2x max Illumina lib insert size, meaning 1200bp, maybe plus some SD just in case I landed in a region where two reads from the pair are separated by more than the insert size.

Having written above: I will try get an world-accessible https machine configured so it requires certificates. But I doubt it is something I can getup and running this evening.

EDIT spelling fix

ADD REPLY
1
Entering edit mode

FYI, from what I recall, the smallest chunk sizes you end up being able to query are 16kb. Granted, they'll be in a coordinate-sorted order within the BAM file, but you may end up querying more than you would otherwise expect (not to mention that every coordinate belongs to 6 bins).

ADD REPLY
0
Entering edit mode

Yes, I am aware that BAI index has 6 levels. With getLevelSize (from 0 to 4) I get 1,8,64,512,4096. But I assume that htsjdk/samtools etc. are able to figure out the relevant chunk(s) for a given genomic interval. And with getLevelForBin well, I should get the level, assuming I am able to instantiate relevant object with this method.

ADD REPLY
1
Entering edit mode
9.3 years ago

The next version of htslib/samtools is supposed to use libcurl if it's available, so just try the version from github. That's supposed to allow https, sftp, etc. connections.

Edit: I think it's its own branch and hasn't gotten merged into devel yet. Have a look here for starters.

ADD COMMENT
1
Entering edit mode

@Devon: Thank you. Heng Li wrote not long time ago that the new version of htslib/samtools is around the corner. So hopefully for the remote access problem is solved for the above protocols. Still, if with a helping hand I will be able to play with i.e. htsjdk/jython, the skills gained will be greater than when I manage to compile some devel branch of htslib. Especially that I have found hardly anything about using htsjdk from jython or groovy. Therefore even if the "getting remote BAM chunk with a cert" with new samtools is the most sensible solution, being able to take road apparently hardly ever traveled makes sense for me, if time-costs of doing it are not too high.

PS I will check the github branches you mentioned.

ADD REPLY
0
Entering edit mode
9.2 years ago
Darked89 4.7k

Here is the Python version, hopefully doing what the C version does. I will really appreciate if you somebody can look at this and confirm it. Apart for SAM format PDF I did look at: https://github.com/lh3/samtools-legacy/blob/master/bam_index.c

#!/usr/bin/env python

BAM_MAX_BIN   = 37450      # =(8^6-1)/7+1
max_chr_size  = 536870912  # 1<<29 #

regbin_dict = {} #dict instead of list

def reg2bins(beg, end, regbin_dict):
    I = 0
    k = 0
    if (beg >= end): return 0
    if (end >= max_chr_size): end = max_chr_size
    end = end - 1
    regbin_dict[i+1] = 0
    #put +1 as upper limit because in C: k <=
    for k in range(1 + (beg>>26),       1 + (end>>26) +1):
        #in C: list[i++] = k
        I += 1
        regbin_dict[i] = k
    for k in range(9 + (beg>>23),       9 + (end>>23) +1):
        I += 1 
        regbin_dict[i] = k
    for k in range(73 + (beg>>20),     73 + (end>>20) +1):
        I += 1 
        regbin_dict[i] = k
    for k in range(585 + (beg>>17),   585 + (end>>17) +1):
        I += 1 
        regbin_dict[i] = k
    for k in range(4681 + (beg>>14), 4681 + (end>>14) +1):
        I += 1
        regbin_dict[i] = k

    print regbin_dict #for testing
    return i

#testing
for beg in range(1, 400000000, 4000000):
    regbin_dict = {} 
    end = beg + 120000
    print beg, end
    n_bins = reg2bins(beg, end, regbin_dict)
    print n_bins
ADD COMMENT
0
Entering edit mode

At this stage the path to follow seems to be:

  1. Get the sequence number (seq_no) either from fasta reference or BAM header to be sure
  2. Get the BAI info as dictionary corresponding to seq_no
  3. Select the 16kb level bins(s) with corresponding chunks, get the binary positions
  4. Dump them separately, find the way to read them
ADD REPLY
0
Entering edit mode

I have compared the Bins for the MT sequence, 16571bp long according to BAM header. So the good news are that from the above script I am getting:

{1: 1, 2: 9, 3: 73, 4: 585, 5: 4681, 6: 4682}

DiskBasedBAMFileIndex.regionToBins(1, 16571) also gives: {0, 1, 9, 73, 585, 4681, 4682}

But then the real Bin numbers from BAI file are: 585, 4681, 37450 (checked with bamUtil and a small D program based on BioD). Not a big deal, since I can intersect Bin numbers from the above script with the ones from BAI. Except for 37450, but then all 3 Bins map to just two LinearIndex Chunks: [224041363/37218, 224279600/28259] (BioD based output) / 0xd5a99939162 & 0xd5e3c306e63 (bamUtil dumpIndex output). The size of BAM is 227675281, so the numbers make sense. Still some work to be done.

ADD REPLY
0
Entering edit mode

Here is a nice version from bamnostic.

ADD REPLY

Login before adding your answer.

Traffic: 1303 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6