Removing / Excluding / Collapsing Overlapping Indels
0
2
Entering edit mode
3 months ago
jon.klonowski ▴ 210

I have trio VCFs (2 parents and a patient) that report overlapping variants with * in the place of ALT. I am looking to take the phased genotyping data from just the patients from each trio VCF to then create an alternative reference transcriptome, but the downstream software does not accept the * character in the input. In my basic understanding, It seems like I should be able to delete all the variants that have * as the ALT, as I am led to believe that these only appear because one subject in the VCF has a SNP within the range of an INDEL and the VCF is reporting the INDEL genotype and the genotype of the patient with the INDEL at the location of the aformentioned SNP. Therefore, if this is true, I should be able to delete the variants with * in the place of ALT because the information of that variant is contained in the primary INDEL. HOWEVER, after I extracted a region that contains a * variant for a single subject in the trio VCF ), the output contains the * variant and other variants (see example below), but the phasing information seems to suggest that there is no other variant on the allele where the * variant is being attributed to. This confuses me and makes me hesitant to just delete the variant.

Here is this example; In the trio VCF - so 2 parents and the patient - I find:

#CHROM  POS      REF     ALT
chr1    154590147  CCG     C
chr1    154590148  CG      C
chr1    154590149  G       *
chr1    154590149  G       C

and then, when I just extract the patient genotypes using bcftools query:

#CHROM  POS         REF   ALT     GT
chr1     154590148   CG  C      0|1
chr1     154590149   G   *      1|0
chr1     154590149   G   C      0|1

What is going on here? I am thinking there should be a step in the genotyping or a normalization or something where I can create a VCF without these! The reason is that if subject 1 has a SNP, and subject 2 has an indel spanning subject 1's SNP, I would want just 1 SNP and 1 INDEL reported, without reporting the genotype of the SNP location in the patient with the INDEL.

bcftools genome vcf genotyping gatk • 384 views
ADD COMMENT
0
Entering edit mode

Here is an another example of one of these overlapping SNPs being called on the other allele of a phased VCF without an upstream deletion on that same allele, this time with more details directly pulled from the trio VCF:

chr2    213147679   .   CAA C   PASS    **0|0**:89,7:96:52:21:21:0|1:213147679_CAA_C:0,21,2979:0,52,3045:213147679  **0|1**:73,9:82:62:21:21:0|1:213147679_CAA_C:93,0,2513:62,0,2569:213147679  **0|1**:76,13:90:99:21:21:0|1:213147679_CAA_C:229,0,2565:198,0,2621:213147679

chr2    213147681   rs6738070   A   C   PASS    **1|1**:0,89:96:52:1|1:213147679_CAA_C:3857,364,0:3881,373,0:213147679  **0|1**:1,72:82:91:1|0:213147679_CAA_C:3874,422,122:3867,400,91:213147679   **0|1**:0,76:90:99:1|0:213147679_CAA_C:3746,559,229:3739,537,198:213147679

chr2    213147681   rs6738070   A   *   PASS    **0|0**:0,7:96:52:1|1:213147679_CAA_C:3857,2862,2979:3881,2905,3045:213147679   **1|0**:1,9:82:91:1|0:213147679_CAA_C:3874,2444,2486:3867,2456,2521:213147679   **1|0**:0,13:90:99:1|0:213147679_CAA_C:3746,2516,2565:3739,2528,2600:213147679
ADD REPLY

Login before adding your answer.

Traffic: 1848 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6