Question

Convert multiple vcf.gz files to vcf.bgz files

0

Entering edit mode

4.4 years ago

selvasathappan36 • 0

Hello Everyone,

I have files like as shown below

Test_t5.chr12.info.gz 
Test_t5.chr12.dose.vcf.gz.tbi
Test_t5.chr12.dose.vcf.gz
..
..
..
Test_t5.chr11.info.gz 
Test_t5.chr11.dose.vcf.gz.tbi
Test_t5.chr11.dose.vcf.gz

I would like to convert all my vcf.gz files to vcf.bgz files

Can you let me know how can I do this?

I have the below command from stack overflow. Is the below right way to do it or should I use tabix. If yes, can you share on how can it be done using tabix?

gunzip -c file.vcf.gz | bgzip  > file.vcf.bgz

But how can I do this for all my vcf.gz files instead of me running this for all files manually.?

I expect my output to be like as shown below

Test_t5.chr12.info.gz 
Test_t5.chr12.dose.vcf.gz.tbi
Test_t5.chr12.dose.vcf.bgz            # see change here
..
..
..
Test_t5.chr11.info.gz 
Test_t5.chr11.dose.vcf.gz.tbi
Test_t5.chr11.dose.vcf.bgz          #see change here

genome sequencing RNA-Seq gene next-gen • 4.8k views

ADD COMMENT • link updated 4.4 years ago by JC 13k • written 4.4 years ago by selvasathappan36 • 0

1

Entering edit mode

If you already have the tbi files, your VCFs are already compressed with BGZip, just check with file *vcf.gz. If that is the case, just rename the file.

ADD REPLY • link 4.4 years ago by JC 13k

1

Entering edit mode

To add to JC's point, the difference between the file output for a gzip file versus a bgzip file will be that for the latter, it will mention the presence of an extra field.

ADD REPLY • link 4.4 years ago by Ram 44k

1

Entering edit mode

Beyond that, it is actually unusual to use bgz suffix. Many tools requiring bgzip-compressed data (to my knowledge) actually expect the normal gz suffix. If tabix indexing works then it is bgzip, otherwise it throws an error.

ADD REPLY • link 4.4 years ago by ATpoint 86k

2

Entering edit mode

I'm not sure if it is unusual. I have used bgz to hint that the file is likely indexed with tabix or similar. I have seen others use this convention.

ADD REPLY • link 4.4 years ago by Alex Reynolds 36k

2

Entering edit mode

gnomAD uses it (Hover over the VCF file links here: https://gnomad.broadinstitute.org/downloads - all .vcf.bgz). It's not unusual, but not necessary either. Anything that can work with a gzipped file can work with a bgzipped file, and tools that need a bgzipped file should be equipped to error out if the file is not bgzipped. If it's just the extension that's causing OP's problem, they can rename or create soft-links.

ADD REPLY • link 4.4 years ago by Ram 44k

1

Entering edit mode

Good to know, have not seen it myself so far.

ADD REPLY • link 4.4 years ago by ATpoint 86k

0

Entering edit mode

@JC @RamRS - Thanks a lot for your suggestions. However, I have an issue. When I execute the command file *vcf.gz, I get the output like Test_t5.chr12.dose.vcf.gz: gzip compressed data, extra field. So, I think all my files are bgzip file. Am I right? So, I renamed the file extension from .gz to .bgz and used hail to import the vcf.bgz file. However, I got an error message which states that file does not conform to block zip format. May I know why does this happen despite it being in a .bgz format. Can I kindly request your help please

ADD REPLY • link 4.4 years ago by selvasathappan36 • 0

0

Entering edit mode

your files are not BGzip, are GZip, you will need to recompress them

ADD REPLY • link 4.4 years ago by JC 13k

0

Entering edit mode

@JC, May I know how do you say that my files aren't BGzip. The command output has mention of extra head field. Am I right?. The command produces output which is like as below

Test_t5.chr12.dose.vcf.gz: gzip compressed data, extra field

Can you also let us know how the command output should look like if its a BGzip file (because it already looks like what @RamRS and @Alex Reynolds (based on hexdump) mentioned).

ADD REPLY • link 4.4 years ago by selvasathappan36 • 0

1

Entering edit mode

Here is how should look like:

$ file test_chr22.vcf.gz   # this is BGZip file
test_chr22.vcf.gz: Blocked GNU Zip Format (BGZF; gzip compatible), block length 9518
$ gunzip -c test_chr22.vcf.gz | gzip > test_chr22_GZIP.vcf.gz
$ file test_chr22_GZIP.vcf.gz # this is a GZip file
test_chr22_GZIP.vcf.gz: gzip compressed data, from Unix, original size modulo 2^32 174414528 gzip compressed data, unknown method, ASCII, has CRC, extra field, has comment, from FAT filesystem (MS-DOS, OS/2, NT), original size modulo 2^32 174414528

ADD REPLY • link 4.4 years ago by JC 13k

score 1 · Answer 1 · 2020-09-10

1

Entering edit mode

4.4 years ago

Alex Reynolds 36k

As mentioned, if you have a .tbi file, then the prefix of that file with .gz probably means it was made with bgzip.

You can use htslib to check if the file is block-compressed, or use hexdump to check the first two bytes (for the file type), the fourth byte (to see if there is an extra header set) and the 13th and 14th bytes (to check for the extra header):

$ hexdump -s 0 -n 2 -e '8/1 "%02x""\n"' some_file.gz
1f8b
$ hexdump -s 3 -n 1 -e '8/1 "%d""\n"' some_file.gz | awk 'and($0,0x04){ print "extra header"; }'
extra header
$ hexdump -s 12 -n 2 -e '8/1 "%c""\n"' some_file.gz
BC

If the first result equals 1f8b, the second result returns extra header, and the third equals BC, then some_file.gz was probably made with bgzip.

If the first result equals 1f8b and the second does not return extra header, then it is likely just a gzip file.

The htslib tool probably does some similar check of the first bytes in the input file, to return a true or false identification.

(Bytes via: https://tools.ietf.org/html/rfc1952#page-5 and http://www.htslib.org/doc/bgzip.html)

If you have a directory of files that are all gzip-formatted and you want to make block-compressed versions, and you are using a bash shell, then you could use a for loop, using the .bgz convention:

$ for in_fn in `ls *.vcf.gz`; do out_fn=${in_fn%.*}.bgz; echo ${out_fn}; gunzip -c ${in_fn} | bgzip > ${out_fn}; tabix -p vcf ${out_fn}; done

If you want to follow the .gz convention, you have to do some extra work:

$ for in_fn in `ls *.vcf.gz`; do tmp_fn=${in_fn%.*}.tmp.gz; echo ${tmp_fn}; gunzip -c ${in_fn} | bgzip > ${tmp_fn}; mv ${tmp_fn} ${in_fn}; tabix -p vcf ${in_fn}; rm ${tmp_fn}; done

Note: This second loop is dangerous, as it will overwrite the original gzip file. I would recommend having backups, writing to a separate directory, or just using the .bgz convention.

ADD COMMENT • link 4.4 years ago by Alex Reynolds 36k

1

Entering edit mode

Is the hexdump check more accurate than a file check?

file some_file.bgz 
some_file.bgz:  gzip compressed data, extra field

file some_other_file.tar.gz
some_other_file.tar.gz: gzip compressed data, from Unix, last modified: Thu Sep 10 12:29:54 2020

ADD REPLY • link 4.4 years ago by Ram 44k

1

Entering edit mode

The hexdump check could check if the extra header contains the block-compressed key (BC). I'm just showing how this tool (or similar) could be used to investigate the file header.

ADD REPLY • link 4.4 years ago by Alex Reynolds 36k

0

Entering edit mode

@Alex Reynolds - Thanks for your answer. Upvoted. Yes, I did try your hexdump solution and one of my file produced the same output as shown in your answer. But your command to convert gz to bgz file runs for a long time and doesn't provide any output after priniting one file name. Its been more than 30 minutes. May I check whether that usually takes time? Am not sure whether you have used hail tool. When I try to read my file using hail - import vcf command, it throws an error that file does not conform to block zip format. have reached out to hail people anyway. May I also know how your command can be used to check for all VCF files at once instead of manually changing filenames one by one

ADD REPLY • link 4.4 years ago by selvasathappan36 • 0

0

Entering edit mode

You could do something like this to check all files:

$ for in_fn in `ls *.vcf.gz`; do GZIP_MAGIC_BYTES=`hexdump -s 0 -n 2 -e '8/1 "%02x""\n"' ${in_fn}`; EXTRA_HEADER=`hexdump -s 3 -n 1 -e '8/1 "%d""\n"' ${in_fn} | awk '{ if (and($0, 0x04)) { print "extra header" } else { print "no extra header" }}'`; BC_KEY=`hexdump -s 12 -n 2 -e '8/1 "%c""\n"' ${in_fn}`; echo -e "${in_fn}\t${GZIP_MAGIC_BYTES}\t${EXTRA_HEADER}\t${BC_KEY}" >> report.txt; done

This creates a file called report.txt that you can browse through with a text editor or spreadsheet tool. The first column of report.txt will be the filename, the remaining columns will be the output of hexdump for each of the three tests.

As far as extraction and recompression times, VCF files can be very large, so it would not be unusual for that to take some time. How large is a typical VCF file on your end?

If you have access to a computational cluster, this work could be parallelized, to speed up conversion, but first check if you really need to do any recompression work, at all.

ADD REPLY • link 4.4 years ago by Alex Reynolds 36k

score 1 · Answer 2 · 2020-09-11

1

Entering edit mode

4.4 years ago

JC 13k

After seeing this is required here is how you can use that

for VCF in *.vcf.gz
do
    gunzip -c ${VCF} | bgzip  > ${VCF%gz}bgz
    tabix ${VCF%gz}bgz
done

ADD COMMENT • link 4.4 years ago by JC 13k

0

Entering edit mode

upvoted for the response

ADD REPLY • link 4.4 years ago by selvasathappan36 • 0

0

Entering edit mode

May I know how long does this take to convert to gzip to bgzip file? I executed the command but its been running for more than 15 mins. I am fine with waiting but to understand whether this usually takes time or something issue with my system./file. As of now, it output only one file name (with new extension). I think your code will pick all files with vcf.gz format adn convert it to vcf.bgz

ADD REPLY • link 4.4 years ago by selvasathappan36 • 0

0

Entering edit mode

that depends on how many files you have, how big each one, how much computing power you have, and how fast are you I/O. You can check progress checking the new files created, with ls -l *bgz*

ADD REPLY • link 4.4 years ago by JC 13k