Hello,
The task is as follows:
Given a local BAM index file (BAI) plus fasta reference and a remote BAM, produce selected region only samtools-readable BAM using curl with SSL & certificates. Minimize compression/decompression, just get the bgzf-compressed chunks containing region of interest (plus some other reads outside the region), perform the exact regions selection on the local drive.
The curl-less variant of it:
Using BAI + fasta reference, translate region (i.e. chr1:2000-4000) into binary positions of the corresponding BAM, save as text. Dump the BAM header + bgzf-compressed chunks creating samtools-readable micro-BAM, i.e. using dd.
There are multiple libraries/programs implementing necessary functions (samtools[C], htsjdk[java], sambamba/bioD[D], even BioGO[go]). My problem is that because I do understand enough of the BAM/BAI file specifications, I have problem understanding what exactly a given chunk of code does, especially that I am coming from the Python camp.
One possible shortcut is to use htsjdk and access the necessary classes/methods using either jython or groovy. In jython 2.7.0 after putting in Jython registry:
python.security.respectJavaAccessibility = false
#in the jython comandline:
import sys
sys.path.append("/usr/local/soft/picard_1.138/htsjdk-1.138.jar")
from htsjdk.samtools import *
from java.io import File
bai_fh = File("./A.bam.bai")
mydict = SAMSequenceDictionary()
bai_foo = DiskBasedBAMFileIndex(bai_fh, mydict)
At this stage I am able to use bai_foo.getNumberOfReferences()
etc., but the desired method getBinsOverlapping(int referenceIndex, int startPos, int endPos)
is in BrowseableBAMIndex
interface. And in this maze of classes, methods, interfaces I am getting lost.
So my questions are:
0. not explored but important: given that BAM file starts with compressed, mixed text and bin values header, I hope it is enough to simply dump everything from the BAM start until the start of the first alignment/unaligned sequences block. What is the correct method to get this number?
1. Given that htsjdk is accessible from Jython/groovy, what is the path to access the getBinsOverlapping?
2. Since I am not restricted to above libs/languages, any hack based on bioD or BioGO, Python or C etc. will be great
Many thanks for your help
Darek Kedra
Edit: The missing groovy part
groovy -cp libs/htsjdk-1.138.jar test_htsjdk.groovy
###### test_htsjdk.groovy ######
#!/usr/bin/env groovy
import htsjdk.samtools.*
File bam_fh = new File("./A.bam")
File bai_fh = new File("./A.bam.bai")
def mydict = new SAMSequenceDictionary()
def bai_foo = new DiskBasedBAMFileIndex(bai_fh, mydict)
println bai_foo.getNumberOfReferences()
The SAM/BAM spec gives C source code to compute bins overlapping a region. You don't need to call getBinsOverlapping() if it is hard. Just convert the 12-line C function to a Python function.
I remember doing exactly this, and the C function maps almost directly to Python. Thanks for the good spec.
I don't understand. You want `samtools view "http://host/remote.bam" chr123:145-6789` .
It's already implemented isn't it ?
@Pierre: yes. But so far not for encrypted connections requiring certificates.
As a test, can you give us any bam file behind a https server?
@Pierre: I am behind a firewall, so unfortunately no toy example easy to create for people from outside. To make things worse, the certs issued on my workstation are not exactly the same as ones certified by some cert authority. And the real killer is that the real data serving hub filters on IP numbers.
I will be happy to have a brain dead utility outputting just two positions inside the BAM file for curl (or dd) to fetch/dump. If
samtools view some_chunk.bam
will print out the all the reads without crashing I am fine. And because I have meta info about genome assembly version inside of every BAM I may need, getting some generic header just thatsamtools view
can work is not a problem, I think. Also because the regions of interest at this stage are SNPs/small indels, the size of a region will be 2x max Illumina lib insert size, meaning 1200bp, maybe plus some SD just in case I landed in a region where two reads from the pair are separated by more than the insert size.Having written above: I will try get an world-accessible https machine configured so it requires certificates. But I doubt it is something I can getup and running this evening.
EDIT spelling fix
FYI, from what I recall, the smallest chunk sizes you end up being able to query are 16kb. Granted, they'll be in a coordinate-sorted order within the BAM file, but you may end up querying more than you would otherwise expect (not to mention that every coordinate belongs to 6 bins).
Yes, I am aware that BAI index has 6 levels. With getLevelSize (from 0 to 4) I get 1,8,64,512,4096. But I assume that htsjdk/samtools etc. are able to figure out the relevant chunk(s) for a given genomic interval. And with getLevelForBin well, I should get the level, assuming I am able to instantiate relevant object with this method.