As mentioned, if you have a .tbi
file, then the prefix of that file with .gz
probably means it was made with bgzip.
You can use htslib
to check if the file is block-compressed, or use hexdump
to check the first two bytes (for the file type), the fourth byte (to see if there is an extra header set) and the 13th and 14th bytes (to check for the extra header):
$ hexdump -s 0 -n 2 -e '8/1 "%02x""\n"' some_file.gz
1f8b
$ hexdump -s 3 -n 1 -e '8/1 "%d""\n"' some_file.gz | awk 'and($0,0x04){ print "extra header"; }'
extra header
$ hexdump -s 12 -n 2 -e '8/1 "%c""\n"' some_file.gz
BC
If the first result equals 1f8b
, the second result returns extra header
, and the third equals BC
, then some_file.gz
was probably made with bgzip
.
If the first result equals 1f8b
and the second does not return extra header
, then it is likely just a gzip
file.
The htslib
tool probably does some similar check of the first bytes in the input file, to return a true or false identification.
(Bytes via: https://tools.ietf.org/html/rfc1952#page-5 and http://www.htslib.org/doc/bgzip.html)
If you have a directory of files that are all gzip
-formatted and you want to make block-compressed versions, and you are using a bash
shell, then you could use a for
loop, using the .bgz
convention:
$ for in_fn in `ls *.vcf.gz`; do out_fn=${in_fn%.*}.bgz; echo ${out_fn}; gunzip -c ${in_fn} | bgzip > ${out_fn}; tabix -p vcf ${out_fn}; done
If you want to follow the .gz
convention, you have to do some extra work:
$ for in_fn in `ls *.vcf.gz`; do tmp_fn=${in_fn%.*}.tmp.gz; echo ${tmp_fn}; gunzip -c ${in_fn} | bgzip > ${tmp_fn}; mv ${tmp_fn} ${in_fn}; tabix -p vcf ${in_fn}; rm ${tmp_fn}; done
Note: This second loop is dangerous, as it will overwrite the original gzip file. I would recommend having backups, writing to a separate directory, or just using the .bgz
convention.
If you already have the tbi files, your VCFs are already compressed with BGZip, just check with
file *vcf.gz
. If that is the case, just rename the file.To add to JC's point, the difference between the
file
output for agzip
file versus abgzip
file will be that for the latter, it will mention the presence of an extra field.Beyond that, it is actually unusual to use
bgz
suffix. Many tools requiring bgzip-compressed data (to my knowledge) actually expect the normalgz
suffix. If tabix indexing works then it is bgzip, otherwise it throws an error.I'm not sure if it is unusual. I have used
bgz
to hint that the file is likely indexed with tabix or similar. I have seen others use this convention.gnomAD uses it (Hover over the VCF file links here: https://gnomad.broadinstitute.org/downloads - all
.vcf.bgz
). It's not unusual, but not necessary either. Anything that can work with a gzipped file can work with a bgzipped file, and tools that need a bgzipped file should be equipped to error out if the file is not bgzipped. If it's just the extension that's causing OP's problem, they can rename or create soft-links.Good to know, have not seen it myself so far.
@JC @RamRS - Thanks a lot for your suggestions. However, I have an issue. When I execute the command
file *vcf.gz
, I get the output likeTest_t5.chr12.dose.vcf.gz: gzip compressed data, extra field
. So, I think all my files arebgzip
file. Am I right? So, I renamed the file extension from.gz
to.bgz
and usedhail
to import thevcf.bgz
file. However, I got an error message which states thatfile does not conform to block zip format
. May I know why does this happen despite it being in a.bgz
format. Can I kindly request your help pleaseyour files are not BGzip, are GZip, you will need to recompress them
@JC, May I know how do you say that my files aren't
BGzip
. The command output has mention ofextra head
field. Am I right?. The command produces output which is like as belowCan you also let us know how the command output should look like if its a BGzip file (because it already looks like what @RamRS and @Alex Reynolds (based on hexdump) mentioned).
Here is how should look like: