Is there a tool that can merge 2 VCF files while taking "representational ambiguity" of multi-allelic variants into account?
By:
- replaying all variant alleles from the 2 VCF files into the reference genome
- identifying which alleles are actually the same but just written down in a different way
- calculating what the best way is to represent the merged variants/alleles in a new (multi-allelic) variant
See also this question and answer. Should you decompose and normalize multi-allelic variants for comparison / ID assignment?
The (multi-allelic) variants (alleles) in both VCF files are different because:
- different technology used to make the VCF files
- different alternative alleles present in samples
BCFtools merge does not take "representational ambiguity" of variants into account (as far as I know)
First decomposing and normalizing all variants to bi-allelic in both input VCF files, then merging and collapsing overlapping variants back to multi-allelic destroys some information?
Can you give me an example? Your three point thing up top is a description of the left-aligned most parsimonious representation done by
vt normalize
/bcftools norm
(I prefer the former)