I attempted to use the VT program (http://genome.sph.umich.edu/wiki/Vt#Installation) to normalize my GRCh37 dbSNP.VCF which was annotated beforehand using bcftools.
My question is how to avoid the error below: (BTW: I know that there are options to relax this consistency check so that the error doesn't pop-up, but when I relaxed the consistency check, the vt program found differences between the entire Chromosome Y sequences (REF) and the entire Chromosome Y sequences in the GRCh37 fasta that has an additional contig section in its header).
Any suggestions?
The error message:
[variant_manip.cpp:96 is_not_ref_consistent] reference bases not consistent: Y:10019-10019 T(REF) vs N(FASTA)
Here is my command for the VT program:
margaret@SII-T7500-01:~/Programs/vt$ ./vt sort -m full /home/margaret/Data/dbSNP/grch37/dbSNP_grch37.vcf | ./vt normalize -r /home/margaret/Data/dbSNP/grch37/human_g1k_v37.fasta - | ./vt uniq -o dbsnp_grch37_normalized.vcf - &
Here is my output:
margaret@SII-T7500-01:~/Programs/vt$ normalize v0.5
options: input VCF file - [o] output VCF file - [w] sorting window size 10000 [m] no fail on masked reference inconsistency false [n] no fail on reference inconsistency false [q] quiet false [d] debug false [r] reference FASTA file /home/margaret/Data/dbSNP/grch37/human_g1k_v37.fasta
uniq v0.57
options: input VCF file -
[o] output VCF file dbsnp_grch37_normalized.vcf
[variant_manip.cpp:96 is_not_ref_consistent] reference bases not consistent: Y:10019-10019 T(REF) vs N(FASTA)
[normalize.cpp:209 normalize] Normalization not performed due to inconsistent reference sequences. (use -n or -m option to relax this)
[W::vcf_parse] INFO 'db' is not defined in the header, assuming Type=String
stats: Total number of observed variants 149043129 Total number of unique variants 147668200
Time elapsed: 8m 23s