bgzip files are backward compatible with gzip, but I have issues when using bgzip compressed vcf files with snpeff (java) or perl scripts that uses IO::Uncompress::Gunzip (that I believe it uses zlib under the hood). In both cases the data is decompressed but truncated after a few hundred lines aprox. I could be totally wrong but I was wondering if zlib (or whatever gzip compatible library they are using) is getting confused with the bgzip bloks and only processing one or a few of them leaving the output incomplete.
perl code that does not work:
#!/usr/bin/env perl
use strict;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError) ;
my $infile = shift;
my $infh = IO::Uncompress::Gunzip->new( $infile ) or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
my $line_count = 0;
while (my $line=<$infh>){
$line_count++
}
print "total lines read = $line_count\n";
This gives 419 lines
$ perl /home/pmg/tmp/test_zlib-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 419
but using open with gzip pipe works:
#!/usr/bin/env perl
use strict;
my $infile = shift;
open(my $infh , 'gzip -dc '.$infile.' |'); # I can use bgzip intead gzip
my $line_count = 0;
while (my $line=<$infh>){
$line_count++
}
print "total lines read = $line_count\n";
Gives the expected number of lines
$ perl /home/pmg/tmp/test_gzip-bgzip.pl 460112_TTAGGC_L005_L006_C3HVJACXX.sorted.rmdup.varsit.vcf.gz
total lines read = 652829
I googled about and I was unable to find quickly any relevant entry, but this is something that I am sure other people would have already faced. Do someone have a clue about why is this happening? I am using ubuntu 12.04.4 with perl 5.16
Thanks Heng, I have read many times the SAM specs, even the bgzip part but I forgot the footnote were it explain the kind of bug that has caught me twice this week using snpeff and VEP: "[2]It is worth noting that there is a known bug in the Java GZIPInputStream class that concatenated gzip archives cannot be successfully decompressed by this class. BGZF files can be created and manipulated using the built-in Java util.zip package, but naive use of GZIPInputStream on a BGZF file will not work due to this bug."
Thanks for posting this, it will surely save others some time in the future. (The quote is from the bottom of page 12 in the SAM specs, in case anyone is wanting a reference).