I have trio VCFs (2 parents and a patient) that report overlapping variants with *
in the place of ALT. I am looking to take the phased genotyping data from just the patients from each trio VCF to then create an alternative reference transcriptome, but the downstream software does not accept the *
character in the input. In my basic understanding, It seems like I should be able to delete all the variants that have *
as the ALT, as I am led to believe that these only appear because one subject in the VCF has a SNP within the range of an INDEL and the VCF is reporting the INDEL genotype and the genotype of the patient with the INDEL at the location of the aformentioned SNP. Therefore, if this is true, I should be able to delete the variants with *
in the place of ALT because the information of that variant is contained in the primary INDEL. HOWEVER, after I extracted a region that contains a *
variant for a single subject in the trio VCF ), the output contains the *
variant and other variants (see example below), but the phasing information seems to suggest that there is no other variant on the allele where the *
variant is being attributed to. This confuses me and makes me hesitant to just delete the variant.
Here is this example; In the trio VCF - so 2 parents and the patient - I find:
#CHROM POS REF ALT
chr1 154590147 CCG C
chr1 154590148 CG C
chr1 154590149 G *
chr1 154590149 G C
and then, when I just extract the patient genotypes using bcftools query
:
#CHROM POS REF ALT GT
chr1 154590148 CG C 0|1
chr1 154590149 G * 1|0
chr1 154590149 G C 0|1
What is going on here? I am thinking there should be a step in the genotyping or a normalization or something where I can create a VCF without these! The reason is that if subject 1 has a SNP, and subject 2 has an indel spanning subject 1's SNP, I would want just 1 SNP and 1 INDEL reported, without reporting the genotype of the SNP location in the patient with the INDEL.
Here is an another example of one of these overlapping SNPs being called on the other allele of a phased VCF without an upstream deletion on that same allele, this time with more details directly pulled from the trio VCF: