I am trying to automate extracting data from an indexed BAM file and I am getting read errors:
[E::bgzf_read_block] Invalid BGZF header at offset 51686517456
[E::bgzf_read] Read block operation failed with error 2 after 0 of 4 bytes
This failure happens after about 12 hours of analysis, near the end of the 52GB BAM file:
$ ll HUDEP.control.DS182418.chr8.bam*
-rw-r--r-- 1 areynolds stamlab 51901713847 Jun 15 12:46 HUDEP.control.DS182418.chr8.bam
-rw-r--r-- 1 areynolds stamlab 7250712 Jun 15 12:59 HUDEP.control.DS182418.chr8.bam.bai
I ran samtools quickcheck
on it, which reports no issues:
$ samtools quickcheck -vvv HUDEP.control.DS182418.chr8.bam
verbosity set to 3
checking HUDEP.control.DS182418.chr8.bam
opened HUDEP.control.DS182418.chr8.bam
HUDEP.control.DS182418.chr8.bam is sequence data
HUDEP.control.DS182418.chr8.bam has 25 targets in header.
HUDEP.control.DS182418.chr8.bam has good EOF block.
I can otherwise read data from it with manual samtools
and pysam
commands to random locations within the BAM file.
$ python
Python 3.8.16 (default, Mar 2 2023, 03:21:46)
[GCC 11.2.0] :: Anaconda, Inc. on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import pysam
>>> print(pysam.__version__)
0.21.0
Remaking the BAM file might be an option but it is very time-consuming. I'd like to understand what went wrong, so I don't have to remake files again.
Is that the only option or are there other tests and fixes I could apply, before this?
https://github.com/samtools/htslib/pull/1676 patches a seek issue in
htslib
(upon whichsamtools
andpysam
depend)Indeed, though it is a seek issue when accessing files over HTTP and other network protocols. So, unless there's something you didn't mention and this question is not in fact about accessing files locally, that is unlikely to be the answer to this question.