BGZF file looks weird
1
0
Entering edit mode
3.4 years ago
HL ▴ 10

I have a big file that is a VCF version 4.2 BGZF-compressed variant calling data (checked with htsfile). The problem is that the file looks really weird, it has only random numbers. This big file isn't compressed by me and I don't know who has done it so don't exactly know how its made. If I try to take for example first 4000 lines from the big file and decompress or look at it, I get error: unexpected end of file. When I compress some other file to BGZF format I can still see separate columns for CHROM POS ID... This file is the weird looking big file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10022-20416-17  10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18  10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18  10070-20895-17  10072-20901-17  10074-20904-17  10080-20908-17  10109-34224-18  1011-22957-18   10118   10123-20960-17  10156-20985-17A 10157-20985-17B 10158-20985-17C 10177-20994-17  10197-34251-18  10226-20274-17  10234-34382-18  10239-21076-17  10241-21077-17  10241-34234-18  10242-33545-18  10246-34072-18  10252-20459-17A 10253-20459-17B 10254-20459-17C 10260-21090-17A 10261-21090-17B 10262-21090-17C 10264-34749-18A 10265-34749-18B 10266-34749-18C

And at some point it looks like this:

1884-24019-18C  19003   1903-24057-18A  1904-24057-18B  1905-24057-18C  19089   19119   19264   19284   19320   19332   19335   19425   19708   19724   19928   19953   19980   19981   20015   20033   20088   20098   20113   20160   20161   20181   2020-24195-18A  2021-24195-18B  2022-24195-18C  20320   20337   20393   20395   2059-23912-18   2063-23906-18   20651   2066-23915-18   20676   20680   20828   2086-24267-18   20874   20899   20913

This one is compressed by myself to BGZF file:

$ bgzip file

And it looks like this:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  11111_22222_19
1       877831  rs6672356       T       C       751.77  PASS    AC=2;AF=1;AN=2;DB;DP=27;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=27.84;SOR=1.781  GT:AD:DP:GQ:PL  1/1:0,27:27:81:780,81,0
1       949608  rs1921  G       A       1765.77 PASS    AC=1;AF=0.5;AN=2;BaseQRankSum=-2.058;ClippingRankSum=0;DB;DP=146;ExcessHet=3.0103;FS=6.037;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=12.09;ReadPosRankSum=-0.567;SOR=1.003 GT:AD:DP:GQ:PL  0/1:77,69:146:99:1794,0,2181

I also have a file that has same variants as the big file but not with all information so (for example GT and AD are missing) I tried to get those from the big file with taking bedtools intersect -u. So I would have the spesicif few variants with all informations. The problem is that now the new file also shows only random numbers but it says that it is only VCF version 4.2 variant calling data. And I can't decompress it because it gives error: unknown suffix -- ignored. (The command was bgzip -d file) This one is gotten with bedtools intersect:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  10022-20416-17  10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18  10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18  10070-20895-17  10072-20901-17  10074-20904-17  10080-20908-17  10109-34224-18  1011-22957-18   10118   10123-20960-17  10156-20985-17A 10157-20985-17B 10158-20985-17C 10177-20994-17  10197-34251-18  10226-20274-17  10234-34382-18  10239-21076-17  10241-21077-17  10241-34234-18  10242-33545-18  10246-34072-18  10252-20459-17A

So how could I get those specific variants with all the information without decompressing the big file, because I can't do that. And is it even normal that a BGZF file looks like that?

EDIT: File had just over 60 000 samples on header line and variants does not have GT infos for all samples. That causes the many different number combinations in the beginning and also the confusing amount of "./.:.:.:.:." these kind of rows. With this command I can get the file look a lot better and different samples are on same line with variants and not in header:

bcftools query -f '[%CHROM %POS %ID %SAMPLE %GT:%DP:%GQ\nCSQ=%CSQ\n\n]' FILE -i 'GT!="./."' | less
BGZF • 1.6k views
ADD COMMENT
0
Entering edit mode
3.4 years ago

this sounds like the file was corrupted, do a

gunzip -c filename.gz > out

and see how much of the file can be decompressed,

bgzip is fully compatible with gzip, thus if you get the problem later in the file you may be able to salvage some content of it.

ADD COMMENT
0
Entering edit mode

It runs the gunzip -c command but the file that it makes has also just those random numbers.

ADD REPLY
0
Entering edit mode

sounds like the file is corrupted and the data is invalid

ADD REPLY
0
Entering edit mode

Just thinking that I still can check the variants for example with bcftools query -f '%CHROM %POS\n'. I thought that it would be impossible to use any commands to a file if it's corrupted?

ADD REPLY
0
Entering edit mode

most likely because it is only partially corrupted, corrupted may mean a lot of things and some information may be salvageable

ADD REPLY

Login before adding your answer.

Traffic: 1950 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6