Tabix '[E::get_intv] Failed to parse TBX_VCF' error
1
0
Entering edit mode
2.5 years ago
bdolin ▴ 100

Greetings, I've seen related threads, but none that seem to point out what I'm overlooking here.

I am trying to bgzip and then tabix index a small VCF file:

##fileformat=VCFv4.1
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  GS000016660-ASM
1   1413232 .   CGC TGT .   .   NS=1;AN=2   GT:PS:FT:GQ:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP   1|0:1413210:PASS:51:166,51:163,47:29,20:-166,0,-51:-29,0,-20:20:10,10:10
1   1469598 .   CG  GC  .   .   NS=1;AN=2   GT:PS:FT:GQ:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP   1/1:.:PASS:68:68,615:68,615:35,35:-615,-68,0:-35,-35,0:38:38,38:0
1   1875858 .   CA  GG  .   .   NS=1;AN=2   GT:PS:FT:GQ:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP   1/1:.:PASS:35:390,35:390,35:33,33:-390,-35,0:-33,-33,0:23:22,22:1
1   2116593 .   CG  GC  .   .   NS=1;AN=2   GT:PS:FT:GQ:HQ:EHQ:CGA_CEHQ:GL:CGA_CEGL:DP:AD:CGA_RDP   1/0:.:VQLOW:21:21,21:4,4:0,17:-21,0,-21:0,0,-17:14:2,12:12

I run bgzip

bgzip myVcf.vcf

and then tabix

tabix myVcf.vcf.gz

and get errors that I cannot figure out:

[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "#"
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""
[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: ""

any ideas would be greatly appreciated.

vcf tabix • 3.2k views
ADD COMMENT
0
Entering edit mode

TBX_VCF in the error message indicates that the file has been identified (via its filename) as VCF, so adding ‑p vcf will make no difference. Following your steps with the small VCF file as shown works for me. So you would need to attach the actual file for us to help you work out where the blank lines and single-#-character lines that tabix is seeing are.

ADD REPLY
0
Entering edit mode

Thank you John. File is here: https://drive.google.com/file/d/1PqFrl-VpPFvMCQZhEv8baCxkIR2zdK3O/view?usp=sharing

I ran the latest version of snpEff, on a windows machine. I'm running tabix on ubuntu. Appreciate your help.

ADD REPLY
3
Entering edit mode
2.5 years ago

Congratulations, you have demonstrated a failure mode that I have not seen in the wild before.

$ od -c myVcf.vcf
0000000  377 376   #  \0   #  \0   f  \0   i  \0   l  \0   e  \0   f  \0
0000020    o  \0   r  \0   m  \0   a  \0   t  \0   =  \0   V  \0   C  \0
0000040    F  \0   v  \0   4  \0   .  \0   1  \0  \n  \0   #  \0   C  \0
0000060    H  \0   R  \0   O  \0   M  \0  \t  \0   P  \0   O  \0   S  \0
[…]

 

Your file is UTF-16-encoded.

$ file myVcf.vcf 
myVcf.vcf: Variant Call Format (VCF) version 4.1, Unicode text, UTF-16, little-endian text
$ htsfile myVcf.vcf 
myVcf.vcf:  unknown data

You will need to convert it to ASCII or UTF-8 for it to be a usable VCF file. Windows, eh?!

ADD COMMENT
0
Entering edit mode

No doubt, this site has once again justified the name 'biostars'. Thank you!

ADD REPLY
0
Entering edit mode

Pull request to improve the error message so this encoding issue is easily identified: HTSlib PR #1487.

ADD REPLY
0
Entering edit mode

Version 4.3 of the VCF specifications added clarification around VCF newlines and file encoding:

The character encoding of VCF files is UTF-8.

Line separators must be CR+LF or LF

As the header specifies version 4.1 the VCF would technically be a valid version 4.1 VCF but that doesn't help since the majority of VCF tools support only UTF-8.

ADD REPLY
0
Entering edit mode

The VCF 4.1 spec is silent on the matter, simply saying “VCF is a text file format (most likely stored in a compressed manner)”. IMHO it is not accurate to say even that this implies that a UTF-16-encoded text file is merely “technically valid” according to the VCF 4.1 spec. (At best, it is unspecified.)

To make that leap, you should ask the editors what their intentions were. Conversations from the mailing list in 2012 show that the editors of the time's intentions were “ASCII only”. Encodings were not mentioned explicitly, but to my mind the context and the absence of consideration of encodings implies an 8-bit encoding.

This is backed up by the practical considerations, as we have seen in this biostars question, i.e., tools actually don't read 'em.

ADD REPLY

Login before adding your answer.

Traffic: 1679 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6