I have some older VCF files that don't have the contig length set in the VCF header. This means that Picard and some other tools that are very strict with the VCF spec won't accept them.
The contig entries in the header should be
##contig=<ID=1,length=195471971>
##contig=<ID=2,length=182113224>
but they are
##contig=<ID=1>
##contig=<ID=2>
I know that I can manually fix this by doing the following steps.
- unzipping the file
- extracting the header
- lookup the contig lenghts in a fasta.fa.fai file
- adding the lenght to the contigs records in the header with vim
- re-header with bcftools
- bgzip and tabix the the re-headered vcf file
In my hands this works but if you make a slight VIM copy paste error you will have spend a lot of time reheadering and bgzipping a large VCF file for nothing.
Therefore I would like to have a more robust automatic solution where I just give the VCF file and the reference genome file and the header is automatically fixed and a new bgzipped VCF file is written out.
Is there a tool that can add the contig lenghts to the VCF header and write out a new bgzipped VCF file?
Great solution, particularly as there's an opportunity to double-check the lengths before writing them to the VCF :)