I have downloaded variation data from the popfly database (http://popfly.uab.cat). However, the vcf headers are all over the place. If I try to index the vcf files I get segmentation fault: 11, and if I try to parse them using bcftools I get various errors in the header. I think the easiest solution would be to entirely rewrite the headers. Is there a tool that does this?
As an aside, is there any benefit to vcf files over bed files? I know many annotation tools take vcf as input (e.g. snpEff), but I can strip away the header and convert to big bed and still annotate with annovar for instance.
Here is the header and the first few lines:
##fileformat=VCFv4.1
##contig=<ID=1,length=23011544>
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT RAL-149_Chr2L RAL-335_Chr2L RAL-357_Chr2L RAL-399_Chr2L RAL-42_Chr2L RAL-461_Chr2L RAL-491_Chr2L RAL-703_Chr2L RAL-721_Chr2L RAL-783_Chr2L USI02_Chr2L USI03_Chr2L USI16_Chr2L USI17_Chr2L USI24_Chr2L USI31_Chr2L USI33_Chr2L USI34_Chr2L USI35_Chr2L USI38_Chr2L USW_35_Chr2L USW_37_Chr2L USW_40_Chr2L USW_49_Chr2L USW_50_Chr2L USW_54_Chr2L USW_59_Chr2L USW_66_Chr2L USW_69_Chr2L USW_74_Chr2L
2L 5039 . C N,T . . . GT 0 1 1 2 1 0 1 0 1 1
2L 5076 . G N,T . . . GT 0 1 1 1 1 0 1 0 1 1
2L 5092 . C N,T . . . GT 0 1 2 2 2 0 1 0 1 1
2L 5095 . T N,A . . . GT 0 1 2 2 2 0 1 0 1 1
2L 5317 . G N,A . . . GT 0 0 0 0 0 0 1 0 1 0
2L 5372 . T N,A . . . GT 0 1 0 0 2 0 1 0 1 0
Can you post an example of the header that is giving you problems? I work with vcf daily, haven't had any problems!
Hi, I have added an example header. Thanks.
header seems fine, what is the problem you get besides segmentation fault? Do you get more info?
When I try:
I get:
Yet when I zip the file with bgzip and then try to tabix index it I get segmentation fault 11.
It seems that your file doesn't have all the contigs in the header. How did you generate such vcf file? can you post the generation command?
I didn't actually generate the file. It is from the pop fly drosophila database (http://popfly.uab.cat/). I confess that I am a primarily theoretical biologist and so do not generate my own data, so I am not particularly familiar with the vcf generation process.
Ok, we might be closer to the solution. How did you get the file? can you download it from there?
Quick workaround for the time being:
you can modify the header lines where the contigs are specified, adding one for each chromosome (2L, 2R, 3L, 3R, X) where you specify the length. Like this:
Thank you. Yes you can download the file from the link I sent above. Under resources you can download data. It is the vcf file for the AM population (I cannot link directly for some reason).
Have you checked the line endings and file encoding? Perhaps that is what is tripping up the software.
If you want to post the VCF in question I can take a closer look at it.
Thanks. I have edited to include both the header and the first few lines.