I have a big file that is a VCF version 4.2 BGZF-compressed variant calling data (checked with htsfile). The problem is that the file looks really weird, it has only random numbers. This big file isn't compressed by me and I don't know who has done it so don't exactly know how its made. If I try to take for example first 4000 lines from the big file and decompress or look at it, I get error: unexpected end of file. When I compress some other file to BGZF format I can still see separate columns for CHROM POS ID... This file is the weird looking big file:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10022-20416-17 10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18 10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18 10070-20895-17 10072-20901-17 10074-20904-17 10080-20908-17 10109-34224-18 1011-22957-18 10118 10123-20960-17 10156-20985-17A 10157-20985-17B 10158-20985-17C 10177-20994-17 10197-34251-18 10226-20274-17 10234-34382-18 10239-21076-17 10241-21077-17 10241-34234-18 10242-33545-18 10246-34072-18 10252-20459-17A 10253-20459-17B 10254-20459-17C 10260-21090-17A 10261-21090-17B 10262-21090-17C 10264-34749-18A 10265-34749-18B 10266-34749-18C
And at some point it looks like this:
1884-24019-18C 19003 1903-24057-18A 1904-24057-18B 1905-24057-18C 19089 19119 19264 19284 19320 19332 19335 19425 19708 19724 19928 19953 19980 19981 20015 20033 20088 20098 20113 20160 20161 20181 2020-24195-18A 2021-24195-18B 2022-24195-18C 20320 20337 20393 20395 2059-23912-18 2063-23906-18 20651 2066-23915-18 20676 20680 20828 2086-24267-18 20874 20899 20913
This one is compressed by myself to BGZF file:
$ bgzip file
And it looks like this:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 11111_22222_19
1 877831 rs6672356 T C 751.77 PASS AC=2;AF=1;AN=2;DB;DP=27;ExcessHet=3.0103;FS=0;MLEAC=2;MLEAF=1;MQ=60;QD=27.84;SOR=1.781 GT:AD:DP:GQ:PL 1/1:0,27:27:81:780,81,0
1 949608 rs1921 G A 1765.77 PASS AC=1;AF=0.5;AN=2;BaseQRankSum=-2.058;ClippingRankSum=0;DB;DP=146;ExcessHet=3.0103;FS=6.037;MLEAC=1;MLEAF=0.5;MQ=60;MQRankSum=0;QD=12.09;ReadPosRankSum=-0.567;SOR=1.003 GT:AD:DP:GQ:PL 0/1:77,69:146:99:1794,0,2181
I also have a file that has same variants as the big file but not with all information so (for example GT and AD are missing) I tried to get those from the big file with taking bedtools intersect -u. So I would have the spesicif few variants with all informations. The problem is that now the new file also shows only random numbers but it says that it is only VCF version 4.2 variant calling data. And I can't decompress it because it gives error: unknown suffix -- ignored. (The command was bgzip -d file) This one is gotten with bedtools intersect:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT 10022-20416-17 10024-34469-18A 10025-34469-18B 10034-31625-18A 10035-31625-18B 10036-31625-18C 10042-29083-18 10044-34485-18A 10045-34485-18B 10046-34485-18C 10069-33802-18 10070-20895-17 10072-20901-17 10074-20904-17 10080-20908-17 10109-34224-18 1011-22957-18 10118 10123-20960-17 10156-20985-17A 10157-20985-17B 10158-20985-17C 10177-20994-17 10197-34251-18 10226-20274-17 10234-34382-18 10239-21076-17 10241-21077-17 10241-34234-18 10242-33545-18 10246-34072-18 10252-20459-17A
So how could I get those specific variants with all the information without decompressing the big file, because I can't do that. And is it even normal that a BGZF file looks like that?
EDIT: File had just over 60 000 samples on header line and variants does not have GT infos for all samples. That causes the many different number combinations in the beginning and also the confusing amount of "./.:.:.:.:." these kind of rows. With this command I can get the file look a lot better and different samples are on same line with variants and not in header:
bcftools query -f '[%CHROM %POS %ID %SAMPLE %GT:%DP:%GQ\nCSQ=%CSQ\n\n]' FILE -i 'GT!="./."' | less
It runs the gunzip -c command but the file that it makes has also just those random numbers.
sounds like the file is corrupted and the data is invalid
Just thinking that I still can check the variants for example with
bcftools query -f '%CHROM %POS\n'
. I thought that it would be impossible to use any commands to a file if it's corrupted?most likely because it is only partially corrupted, corrupted may mean a lot of things and some information may be salvageable