Structural variation annotation is different in VCF. Please refer to the guidelines VCF format v4.2.
For your example:
1 25688926 DUP_gs_CNV_1_25688926_25700415 G <CN0>,<CN2> . PASS SVTYPE=CNV;END=25700415;CS=DUP_gs;AC=101,12;AF=0.02016773,0.00239617;NS=2504;AN=5008;EAS_AF=0.0169,0.004;EUR_AF=0.007,0.001;AFR_AF=0.0492,0.0015;AMR_AF=0.013,0.0014;SAS_AF=0.0031,0.0041 GT
- Chromosome: 1
- Start: 25688926
- End: 25700415
- Type: multiallelic CNV
- Reference Allele: Copy Number 1
- Alternate Alleles: Copy Number 0, Copy Number 2
On genotypes:
0|0: copy number 2
0|1: copy number 1 (1 + 0)
0|2: copy number 3 (1+ 2)
1|2: copy number 2 (0 + 2)
2|2: copy number 4 (2 + 2)
If you want to parse a vcf in perl
#!/usr/bin/perl
open IN, $ARGV[0];
while(<IN>){
next if ($_ =~ /^\#/);
chomp $_;
my @r = split /\t/, $_;
my @info = split /\;/, $r[7];
undef my $e;
foreach(@info) { if ($_ =~ /END=/) { $e=$_; $e =~ s/END=//; }}
print $r[0],"\t",$r[1],"\t",$e,"\n";
}
close IN;
In python
#!/usr/env python
import sys
with open(sys.argv[1],'r') as f:
for l in f:
if l.startswith('#'): continue
r = l.rstrip('\n').split('\t')
e=''
for x in r[7].split(';'):
if 'END=' in x: e=x.replace('END=','')
print r[0],r[1],e
very helpful :) thanks!
Thanks for your reply. My question was about the exact start position, but I have found the answer in the VCF format 4.2 guidelines that you linked to.
If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism
That means that the deleted/duplicated region in this case starts at 25,688,927 and ends at 25,700,415 (size = 11,489 bp).