Question

Problems left-normalizing variants with bctools, GATK, combining with ANNOVAR annotations

0

Entering edit mode

6.6 years ago

steve ★ 3.5k

I am trying to normalize, filter, and annotate variants in .vcf format. Right now, my workflow looks like this:

left-normalize & filter .vcf (bcftools / GATK)
convert .vcf to .tsv (GATK)
recalculate values in .tsv (e.g. HaplotypeCaller frequency, etc.)
annotate .vcf (ANNOVAR)
merge annotation & .tsv

However, I am having issues with variants that are formatted like this:

chrX    66766356    .   TGGCGGCGGCGGC   T

when I try to 'normalize' them using bcftools norm and GATK LeftAlignAndTrimVariants, these variants are not changed.

But, when I pass these variants through ANNOVAR, the output looks like this:

chrX    66766357    .   GGCGGCGGCGGC    -

This is the preferred format for annotations. But it causes problems because I am now unable to merge values from the original .vcf back into the ANNOVAR output.

As per the comment on the bcftools issue posted here, the ANNOVAR output format is "not a valid VCF record". So it seems that maybe variant normalization tools would not be appropriate for producing this output?

Any ideas on how to fix this workflow and get both the custom selected & recalculated fields from the original .vcf combined with the ANNOVAR output in these cases?

annotation variant • 3.1k views

ADD COMMENT • link 6.6 years ago by steve ★ 3.5k

1

Entering edit mode

Yes, this is the annoying part about ANNOVAR, and it can result in inadvertent information loss if one is not aware of it. I have come up with a few personalised solutions that bypass this, but my situations were not the same as yours. For one, I never wanted to 'marry' the annotated data back to the VCF. The annotation CSV was the end of the line for me.

Why exactly do you need to 'marry' the annotated data back to the VCF? I think that the ANNOVAR function allows you to include various pieces of information from the VCF as extra columns, no?

ADD REPLY • link 6.6 years ago by Kevin Blighe 88k

0

Entering edit mode

In this case, I need to recalculate the allele frequencies from GATK HaplotypeCaller, since they are not listed as the empirical values, and I want to propagate that value through to the final annotation table.

ADD REPLY • link 6.6 years ago by steve ★ 3.5k

score 0 · Answer 1 · 2018-04-10

It looks like a lot of my problems were solved by using the --vcfinput option of ANNOVAR, among other things, like this:

table_annovar.pl "${sample_vcf}" "${annovar_db_dir}" \
        --buildver "${params.ANNOVAR_BUILD_VERSION}" \
        --remove \
        --protocol "${params.ANNOVAR_PROTOCOL}" \
        --operation "${params.ANNOVAR_OPERATION}" \
        --nastring . \
        --vcfinput \
        --otherinfo \
        --onetranscript \
        --outfile "${sampleID}"

This produces an .avinput file that has a listing of all the original lines from the input VCF with their ANNOVAR counterparts, and this command also includes the original VCF data on the annotation file output, along with an extra .vcf formatted annotation file. So, a lot of extra data to play with for custom processing.

For reference, the full workflow I am working on is here: https://github.com/stevekm/vcf-filter-annotate