Start position of structural variants in 1000 Genomes
1
3
Entering edit mode
8.6 years ago
M. Möller ▴ 40

I'm analyzing structural variants in the 1000 Genomes VCF files and I have a question about the start position. Let's say I have the following CNV:

Pos: 25688926  
Chromosome: 1  
ID: esv3585526   
Ref: G  
Alt: <CN2>      
Info: ...END=25700415;SVTYPE=CNV;VT=SV..

What is the start point of this duplication? Does it start at 25688926 and include G (Ref), or does is start after G at 25688927?

Thanks.

1000 Genomes SV • 2.8k views
ADD COMMENT
6
Entering edit mode
8.6 years ago

Structural variation annotation is different in VCF. Please refer to the guidelines VCF format v4.2.

For your example:

1   25688926    DUP_gs_CNV_1_25688926_25700415  G   <CN0>,<CN2> .   PASS    SVTYPE=CNV;END=25700415;CS=DUP_gs;AC=101,12;AF=0.02016773,0.00239617;NS=2504;AN=5008;EAS_AF=0.0169,0.004;EUR_AF=0.007,0.001;AFR_AF=0.0492,0.0015;AMR_AF=0.013,0.0014;SAS_AF=0.0031,0.0041   GT
  • Chromosome: 1
  • Start: 25688926
  • End: 25700415
  • Type: multiallelic CNV
  • Reference Allele: Copy Number 1
  • Alternate Alleles: Copy Number 0, Copy Number 2

On genotypes:

0|0:  copy number 2
0|1: copy number 1 (1 + 0)
0|2: copy number 3 (1+ 2)
1|2: copy number 2 (0 + 2)
2|2: copy number 4 (2 + 2)

If you want to parse a vcf in perl

#!/usr/bin/perl
open IN, $ARGV[0];
while(<IN>){
    next if ($_ =~ /^\#/);
    chomp $_;
     my @r = split /\t/, $_;
     my @info = split /\;/, $r[7];
     undef my $e;
     foreach(@info) { if ($_ =~ /END=/) { $e=$_; $e =~ s/END=//; }}
     print $r[0],"\t",$r[1],"\t",$e,"\n";
}
close IN;

In python

#!/usr/env python
import sys
with open(sys.argv[1],'r') as f:
    for l in f:
            if l.startswith('#'): continue
            r = l.rstrip('\n').split('\t')
            e=''
            for x in r[7].split(';'):
                    if 'END=' in x: e=x.replace('END=','')
           print r[0],r[1],e
ADD COMMENT
2
Entering edit mode

very helpful :) thanks!

ADD REPLY
1
Entering edit mode

Thanks for your reply. My question was about the exact start position, but I have found the answer in the VCF format 4.2 guidelines that you linked to.

If any of the ALT alleles is a symbolic allele (an angle-bracketed ID String) then the padding base is required and POS denotes the coordinate of the base preceding the polymorphism

That means that the deleted/duplicated region in this case starts at 25,688,927 and ends at 25,700,415 (size = 11,489 bp).

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6