Here's a massively simplified VCF file with one line:
##fileformat=VCFv4.1
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT M_CJ-R878H_AML1-R878H_AML1
1 3329234 . G T 36.8 . TIER=1 GT 1/1
If I run VEP on it like so, it returns an warning, as it tries to parse the FORMAT header line:
$ perl /path/to/ensembl-tools-release-86/scripts/variant_effect_predictor/variant_effect_predictor.pl -i test.vcf -o out.vcf --offline --cache_version 67 --species mus_musculus --vcf --symbol --format vcf --dir_cache /path/to/.vep --dir_plugins /path/to/VEP_plugins-release-86
2017-05-04 12:46:31 - Read existing cache info
2017-05-04 12:46:31 - Starting...
WARNING: Invalid input formatting on line 2
2017-05-04 12:46:31 - Read 1 variants into buffer
2017-05-04 12:46:31 - Reading transcript data from cache and/or database
[========================================================================================================================] [ 100% ]
2017-05-04 12:46:31 - Retrieved 8 transcripts (0 mem, 8 cached, 0 DB, 0 duplicates)
2017-05-04 12:46:31 - Analyzing chromosome 1
2017-05-04 12:46:31 - Analyzing variants
[========================================================================================================================] [ 100% ]
2017-05-04 12:46:31 - Calculating consequences
2017-05-04 12:46:31 - Processed 1 total variants (1 vars/sec, 1 vars/sec total)
2017-05-04 12:46:31 - Wrote stats summary to out.vcf_summary.html
2017-05-04 12:46:31 - See out.vcf_warnings.txt for details of 1 warnings
2017-05-04 12:46:31 - Finished!
To support my idea that it's not handling the header correctly, if I run this VCF omitting the --format vcf
flag, it is unable to detect that it is a VCF.
It does return the annotated VCF lines correctly when told that it's a VCF, but doesn't pass through the existing header lines and also doesn't add the CSQ
header line that contains the key for parsing the information the VEP adds.
Has anyone encountered this before? Any suggestions on how to make VEP do the right thing here?
Edit to add output, which is sane, but lacking the expected headers:
##fileformat=VCF
1 3329234 . G T 36.8 . GT;CSQ=T|intron_variant|MODIFIER||ENSMUSG00000051951|Transcript|ENSMUST00000070533|protein_coding||2/2||||||||||-1||| 1/1
Sadly, that's not the issue. I'm editing the post above to make the VCF even simpler and make the FORMAT lines match up 100% with the fields - the same warning and header recognition issue persists.