Question

Start position of structural variants in 1000 Genomes

3

Entering edit mode

8.6 years ago

M. Möller ▴ 40

I'm analyzing structural variants in the 1000 Genomes VCF files and I have a question about the start position. Let's say I have the following CNV:

Pos: 25688926  
Chromosome: 1  
ID: esv3585526   
Ref: G  
Alt: <CN2>      
Info: ...END=25700415;SVTYPE=CNV;VT=SV..

What is the start point of this duplication? Does it start at 25688926 and include G (Ref), or does is start after G at 25688927?

Thanks.

1000 Genomes SV • 2.8k views

ADD COMMENT • link 8.5 years ago by M. Möller ▴ 40

score 6 · Accepted Answer · 2016-05-10

Structural variation annotation is different in VCF. Please refer to the guidelines VCF format v4.2.

For your example:

1   25688926    DUP_gs_CNV_1_25688926_25700415  G   <CN0>,<CN2> .   PASS    SVTYPE=CNV;END=25700415;CS=DUP_gs;AC=101,12;AF=0.02016773,0.00239617;NS=2504;AN=5008;EAS_AF=0.0169,0.004;EUR_AF=0.007,0.001;AFR_AF=0.0492,0.0015;AMR_AF=0.013,0.0014;SAS_AF=0.0031,0.0041   GT

Chromosome: 1
Start: 25688926
End: 25700415
Type: multiallelic CNV
Reference Allele: Copy Number 1
Alternate Alleles: Copy Number 0, Copy Number 2

On genotypes:

0|0:  copy number 2
0|1: copy number 1 (1 + 0)
0|2: copy number 3 (1+ 2)
1|2: copy number 2 (0 + 2)
2|2: copy number 4 (2 + 2)

If you want to parse a vcf in perl

#!/usr/bin/perl
open IN, $ARGV[0];
while(<IN>){
    next if ($_ =~ /^\#/);
    chomp $_;
     my @r = split /\t/, $_;
     my @info = split /\;/, $r[7];
     undef my $e;
     foreach(@info) { if ($_ =~ /END=/) { $e=$_; $e =~ s/END=//; }}
     print $r[0],"\t",$r[1],"\t",$e,"\n";
}
close IN;

In python

#!/usr/env python
import sys
with open(sys.argv[1],'r') as f:
    for l in f:
            if l.startswith('#'): continue
            r = l.rstrip('\n').split('\t')
            e=''
            for x in r[7].split(';'):
                    if 'END=' in x: e=x.replace('END=','')
           print r[0],r[1],e