I am trying to normalize, filter, and annotate variants in .vcf format. Right now, my workflow looks like this:
left-normalize & filter .vcf (bcftools / GATK)
convert .vcf to .tsv (GATK)
recalculate values in .tsv (e.g.
HaplotypeCaller
frequency, etc.)annotate .vcf (ANNOVAR)
merge annotation & .tsv
However, I am having issues with variants that are formatted like this:
chrX 66766356 . TGGCGGCGGCGGC T
when I try to 'normalize' them using bcftools norm
and GATK LeftAlignAndTrimVariants
, these variants are not changed.
But, when I pass these variants through ANNOVAR, the output looks like this:
chrX 66766357 . GGCGGCGGCGGC -
This is the preferred format for annotations. But it causes problems because I am now unable to merge values from the original .vcf back into the ANNOVAR output.
As per the comment on the bcftools
issue posted here, the ANNOVAR output format is "not a valid VCF record". So it seems that maybe variant normalization tools would not be appropriate for producing this output?
Any ideas on how to fix this workflow and get both the custom selected & recalculated fields from the original .vcf combined with the ANNOVAR output in these cases?
Yes, this is the annoying part about ANNOVAR, and it can result in inadvertent information loss if one is not aware of it. I have come up with a few personalised solutions that bypass this, but my situations were not the same as yours. For one, I never wanted to 'marry' the annotated data back to the VCF. The annotation CSV was the end of the line for me.
Why exactly do you need to 'marry' the annotated data back to the VCF? I think that the ANNOVAR function allows you to include various pieces of information from the VCF as extra columns, no?
In this case, I need to recalculate the allele frequencies from GATK HaplotypeCaller, since they are not listed as the empirical values, and I want to propagate that value through to the final annotation table.