Question

bcftools norm resulting in '*' in alternate allele

0

Entering edit mode

3.8 years ago

prasundutta87 ▴ 670

Hi,

After splitting multiallelic variants in my human multisample exonic germline VCF, the newly generated file contained many sites with '*' . The command I used is:

bcftools norm --check-ref w -f GCA_000001405.15_GRCh38_no_alt_plus_hs38d1_analysis_set.fna -m -any exonic_variants.vcf.gz >bcftools_norm.vcf

The reason I am seeing this is because it is a spanning deletion (https://gatk.broadinstitute.org/hc/en-us/articles/360035531912-Spanning-or-overlapping-deletions-allele-#article-comments) and the input VCF file (generated using GATK) has this:

chr1    2503910 .       A       C,*

and it got split into:

chr1    2503910 .       A       C
chr1    2503910 .       A       *

My question is how do I treat this scenario? Should I just remove sites with a '*' in the alternate allele? What is the best practice here?

My general goto scenario is to only concentrate on high quality biallelic variants (SNVs) without normalising variants as multiallleic sites are generally considered to be sequencing errors (unless I want to study genetic mosaicism). Since thats not my aim in my current study, is it advisable to not normalise my VCF and directly move towards variant filtration? As in the current study, I also have indels, I can only consider biallelic indels (-v indels -m2 -M2) which removes these sites with '*'.

PS I am using the latest version of bcftools (v1.11)

SNP bcftools VCF exome • 1.8k views

ADD COMMENT • link updated 3 months ago by jon.klonowski ▴ 210 • written 3.8 years ago by prasundutta87 ▴ 670

score 3 · Answer 1 · 2021-02-11

3

Entering edit mode

3.8 years ago

Pierre Lindenbaum 164k

It is meaningless in a context where there is only one ALT allele so you can remove it. There is no lost alternate allele/variant, because '*' is the gap of an upstream indel.

ADD COMMENT • link 3.8 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

Thanks for this, Pierre!

ADD REPLY • link 3.8 years ago by prasundutta87 ▴ 670

0

Entering edit mode

I have a case where I have a phased VCF and * allele is being reported on the other allele and no upstream indel being reported:

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   *      1|0
chr1     154590149   G   C      0|1

see my question: Removing / Excluding / Collapsing Overlapping Indels

ADD REPLY • link 3 months ago by jon.klonowski ▴ 210