Question

How To Represent Two Different Indels At The Same Position In A Multisample Vcf?

0

Entering edit mode

11.5 years ago

Luca Beltrame ▴ 250

While working to get this issue fixed in VarScan, I'm attempting to generate (or rather correct from the original output) a VCF record for two samples, each with a different indel at the same position.

To make it simple, the situation is:

First reference base: C
Indel in sample 1: CAA -> C (loss of 2 bases)
Indel in sample 2: CA -> C (loss of 1 base)

I know from the data that this is likely an artifact (low coverage region) but still I need to generate a proper record for it or my analysis pipeline will not work (the GATK will complain about an invalid record, see the last post in the link for more details).

How would I go to represent this in a VCF? In particular, how should I represent the REF and ALT records? Should I split this in two records, or keep everything in one?

Thanks!

vcf variant-calling sequencing • 4.6k views

ADD COMMENT • link updated 10.9 years ago by Biostar 20 • written 11.5 years ago by Luca Beltrame ▴ 250

1

Entering edit mode

For now I'm assuming that the reference sequence is the longest (CAA) , sample 1 has C as ALT allele, and sample 2 CA as ALT (so ALT is C,CA). Am I going in the right direction?

ADD REPLY • link 11.5 years ago by Luca Beltrame ▴ 250

0

Entering edit mode

That's how I would also read the VCF spec. (namely, REF= CAA and ALT= CA,C).

ADD REPLY • link 11.5 years ago by Devon Ryan 105k

0

Entering edit mode

what about just using the comma to separate all the possible variants, in the ALT column?

ADD REPLY • link 11.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

But in one case the reference would be CAA, and in the other CA. In both cases the deletion is represented as C, but it is the affected reference sequence that changes.

ADD REPLY • link 11.5 years ago by Luca Beltrame ▴ 250

0

Entering edit mode

In principle there it should be only one reference allele. What is the sequence of the reference genome at NCBI, for that position?

ADD REPLY • link 11.5 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

The problem is how to make it "proper" inside the VCF. The first base in the reference is C. Then we have a stretch of As. So (see my comment below) in fact it is the REF bit that should be writen in a different way.

ADD REPLY • link 11.5 years ago by Luca Beltrame ▴ 250

Ram · Answer 1 · 2013-10-09

3

Entering edit mode

11.5 years ago

Erik Garrison ★ 2.4k

You can combine these variants using the vcfmulti tool in vcflib:

<broken.vcf vcfcreatemulti >ok.vcf

However, this won't really handle the sample genotypes. These need to be recreated relative to each other. Ideally, this reconstruction should respect the underlying sequence reads.

Another approach would be to use a variant detection method that calls the samples and overlapping alleles jointly. I don't know the details of your pipeline, so perhaps this is inapplicable.

ADD COMMENT • link updated 5.3 years ago by Ram 45k • written 11.5 years ago by Erik Garrison ★ 2.4k

0

Entering edit mode

I'll look into it, thanks. The issue here is a bug in VarScan, which was reported. However I'll see whether vcfmulti can work as stopgap solution.

ADD REPLY • link 11.5 years ago by Luca Beltrame ▴ 250