BACKGROUND
I have dockerized app running IGV.js and serving files from backend. In casual alignment files from shotgun sequencing (illumina etc.) I can see that my server correctly responds to range requests sent by IGV:
Content-Range: bytes 1969716952-1969767576/2396323876
Content-Type: application/octet-stream
This works okay and my servers responds with 206
HTTP status code (partial response) for almost all bam-file/fasta-ref pairs. The resulting payloads are always around 12-50kB
. This is the expected behavior as it avoids IGV having to download the whole bam/fasta.
ISSUE
Given: A reference file around 4kB (reverse transcripted RNA) with hundreds of reads (very high depth).
In certain cases where bam files belong to Oxford Nanopore generated data, IGV.js will request the entire bam file, whether the bam file is 10Mb or 700Mb. The range request from the browser (initiated by IGV.js) looks like this:
Range: bytes=270-7327614
And of course the server has no choice but send almost the entire file:
Content-Range: bytes 270-7327614/7612045
This basically requests the entire file and will cause the browser to freeze in cases where the bam file size 700Mb, until the entire bam is read into memory. It was difficult to understand at the beginning why this behavior occurs, then I inspected the index (.bai)
files and realized regardless of bam file size, the generated *.bai
file is always 96bytes
. This does not occur in normal bam/fasta
pairs, where the *.bai
files are generally around 1-2Mb
.
I am guessing that above is the reason why IGV.js is requesting the entire file for ONT datasets. Why are these *.bai files always 96b and how can I fix it?
WHAT HAS BEEN TRIED
Setting the visibility Window
option for IGV.js to 50 base pairs so that IGV does not request anything until your viewport is small enough did not work. Even in case of 50bp, flanking regions are not requested, which means even if you scroll a little bit to the left/right, IGV will re-request the entire file every time.
(Below are pseudo descriptions, I did not run them literally as they are in bash)
SAMTools Mappings Sorter .mapped.bam > .mapped.sorted.bam
SAMTools Mappings Indexer .mapped.sorted.bam > mapped.sorted.bam.bai --> 96bytes!
Minimap2 Aligner for Long Reads .fastq.gz > .mapped.bam + .mmi + .mapped.bam.bai --> 96bytes!
In both cases above, the generated bai
files were always 96b. This file size did not vary based on the bam file size.
ADDITIONAL INFO
I posted this question on Github as issue
DISCLAIMER
I do not own the data, and I am not authorized to share it.
Many thanks for the explanation! Before I accept, I would like to ask a few more things to clarify the situation for myself. So as far as I understand:
In this case, I had 7-10Mb bam files instead if 700Mb bam file, as I described in the question. Even in these small bam files, IGV was requesting the entire file over and over again if you scroll left or right. From your explanation I understand there is no workaround.
Is there a platform for Oxford Nanopore, where I can post the issue and they get together with IGV developers and develop a new format, scheme..etc. to prevent this from happenning?? Because requesting files over and over again, even small, is very inefficient.
You can split the bam easily after alignment (sort by read name, chop it up, sort by position).
You can post the issue in the appropriate IGV repository: https://github.com/igvteam and be sure to specify that your issue is a side-case where the reference length is smaller than the .bai interval size. My guess is that there may be an edge case when, if there is only one "interval" in the index, then even if all the reads are held in memory, any kind of scrolling re-triggers a request when it otherwise wouldn't.