I'm trying to interpret copy number as described in 1000 genome's phase3 integrated call set. Here are some relevant lines from the VCF header:
##fileformat=VCFv4.1
##contig=<ID=1,assembly=b37,length=249250621>
##ALT=<ID=CNV,Description="Copy Number Polymorphism">
##ALT=<ID=DEL,Description="Deletion">
##ALT=<ID=DUP,Description="Duplication">
##ALT=<ID=INS:ME:ALU,Description="Insertion of ALU element">
##ALT=<ID=INS:ME:LINE1,Description="Insertion of LINE1 element">
##ALT=<ID=INS:ME:SVA,Description="Insertion of SVA element">
##ALT=<ID=INS:MT,Description="Nuclear Mitochondrial Insertion">
##ALT=<ID=INV,Description="Inversion">
##ALT=<ID=CN0,Description="Copy number allele: 0 copies">
##ALT=<ID=CN1,Description="Copy number allele: 1 copy">
##ALT=<ID=CN2,Description="Copy number allele: 2 copies">
##ALT=<ID=CN3,Description="Copy number allele: 3 copies">
##ALT=<ID=CN4,Description="Copy number allele: 4 copies">
{...}
##ALT=<ID=CN124,Description="Copy number allele: 124 copies">
##INFO=<ID=CS,Number=1,Type=String,Description="Source call set.">
##INFO=<ID=END,Number=1,Type=Integer,Description="End coordinate of this variant">
##INFO=<ID=MC,Number=.,Type=String,Description="Merged calls.">
##INFO=<ID=AC,Number=A,Type=Integer,Description="Total number of alternate alleles in called genotypes">
##INFO=<ID=AF,Number=A,Type=Float,Description="Estimated allele frequency in the range (0,1)">
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of samples with data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
I have filtered the data to only variants with an SVTYPE
INFO tag set to DUP
, DEL
or CNV
.
- Records with
INFO/SVTYPE=DEL
generally haveALT=<CN0>
, but occasionally showALT=<CN0>,<CN2>
. In these cases, there are no calls for<CN2>
. - Records with
INFO/SVTYPE=DUP
generally haveALT=<CN2>
, but occasionally showALT=<CN0>,<CN2>
. In these cases, there are no calls for<CN0>
. - Records with
INFO/SVTYPE=CNV
show a variety of combinations.
Here is a summary of the variants in the above file filtered down to the 3 SVTYPES, with accompanying totals:
# N SVTYPE ALT
6026 DUP <CN2>
100 DUP <CN0>,<CN2>
3 DUP <CN2>,<CN3>
33329 DEL <CN0>
7 DEL <CN0>,<CN2>
6 DEL G
3 DEL A
2 DEL C
2 DEL T
1 DEL TGGTTCATTGATATTCTGCTGTGGCAC{..Truncated..},T
2716 CNV <CN0>,<CN2>
136 CNV <CN2>,<CN3>
90 CNV <CN0>,<CN2>,<CN3>
50 CNV <CN2>
35 CNV <CN0>,<CN2>,<CN3>,<CN4>
23 CNV <CN0>,<CN2>,<CN3>,<CN4>,<CN5>
23 CNV <CN2>,<CN3>,<CN4>
12 CNV <CN0>
9 CNV <CN0>,<CN2>,<CN3>,<CN4>,<CN5>,<CN6>
8 CNV <CN2>,<CN3>,<CN4>,<CN5>
4 CNV <CN3>,<CN4>
3 CNV <CN0>,<CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>
3 CNV <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>
2 CNV <CN0>,<CN1>,<CN3>,<CN4>
2 CNV <CN1>,<CN3>
2 CNV <CN1>,<CN3>,<CN4>,<CN5>
1 CNV <CN1>,<CN3>,<CN4>
1 CNV <CN1>,<CN3>,<CN4>,<CN5>,<CN6>
1 CNV <CN2>,<CN3>,<CN4>,<CN5>,<CN6>
1 CNV <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>,<CN8>
1 CNV <CN2>,<CN3>,<CN4>,<CN5>,<CN6>,<CN7>,<CN8>,<CN9>
1 CNV <CN3>
So, how do I interpret <CN2>
when INFO/SVTYPE
is DUP
or CNV
? Despite the header's description of <CN2>
, it seems that it should describe a biallelic duplication when INFO/SVTYPE=DUP
, this idea is makes sense in reading the article. Does the header only apply when INFO/SVTYPE=CNV
?
INFO/SVTYPE=DEL
examples (INFO truncated):
1 738570 esv3584979 G <CN0> 100 PASS
AC=1;AF=0.000199681;AN=5008;CS=DEL_union;END=742020;NS=2504;SVTYPE=DEL;VT=SV
1 766600 esv3584980 G <CN0> 100 PASS
AC=188;AF=0.0375399;AN=5008;CS=DEL_union;END=769112;NS=2504;SVTYPE=DEL;VT=SV
2 50182899 esv3590712;. A <CN0>,<CN2> 100 PASS
AC=3,0;AF=0.000599042,0;AN=5008;CS=DUP_uwash;END=50192857;NS=2504;SVTYPE=DEL;VT=SV
3 138606780 esv3597927;. T <CN0>,<CN2> 100 PASS
AC=1,0;AF=0.000199681,0;AN=5008;CS=DUP_gs;END=138620917;NS=2504;SVTYPE=DEL;VT=SV
INFO/SVTYPE=DUP
examples:
1 668630 esv3584976 G <CN2> 100 PASS
AC=64;AF=0.0127796;AN=5008;CS=DUP_delly;END=850204;NS=2504;SVTYPE=DUP;VT=SV
1 16013837 esv3585317 T <CN2> 100 PASS
AC=11;AF=0.00219649;AN=5008;CS=DUP_delly;END=16080976;MC=DUP_uwash_chr1_16012226_16082907;SVTYPE=DUP;VT=SV
1 16037975 .;esv3585319 G <CN0>,<CN2> 100 PASS
AC=0,11;AF=0,0.00219649;AN=5008;CS=DUP_gs;END=16071850;SVTYPE=DUP;VT=SV
1 153682976 esv3587592;. G <CN2>,<CN3> 100 PASS
AC=194,0;AF=0.038738,0;AN=5008;CS=DUP_gs;END=153696281;SVTYPE=DUP;VT=SV
INFO/SVYPE=CNV
examples:
1 1609210 esv3585011;esv3585012 G <CN0>,<CN2> 100 PASS
AC=17,26;AF=0.00339457,0.00519169;AN=5008;CS=DUP_gs;END=1615827;SVTYPE=CNV;VT=SV
1 143984622 esv3587386;esv3587387 A <CN2>,<CN3> 100 PASS
AC=4791,41;AF=0.956669,0.0081869;AN=5008;CS=DUP_gs;END=144094733;NS=2504;SVTYPE=CNV;VT=SV
1 248619876 esv3589555;esv3589556;esv3589557 A <CN0>,<CN2>,<CN3> 100 PASS
AC=19,859,2;AF=0.00379393,0.171526,0.000399361;AN=5008;CS=DUP_gs;END=248634579;SVTYPE=CNV;VT=SV
Y 28462363 CNV_Y_28462363_28740539 T <CN2> 100 PASS
AC=5;AF=0.00408831;AN=1223;END=28740539;SVTYPE=CNV;VT=SV