I downloaded the SV from gnomad:
wget -O gnomad_v2_sv.sites.vcf.gz "https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2_sv.sites.vcf.gz"
wget -O gnomad_v2_sv.sites.vcf.gz.tbi "https://storage.googleapis.com/gnomad-public/papers/2019-sv/gnomad_v2_sv.sites.vcf.gz.tbi"
First, an observation. For BND, the END value is used in the gnomad browser as the 'END' position of the second junction:
e.g: https://gnomad.broadinstitute.org/variant/BND_1_13?dataset=gnomad_sv_r2 ; https://gnomad.broadinstitute.org/variant/BND_1_21?dataset=gnomad_sv_r2 ; https://gnomad.broadinstitute.org/variant/BND_1_31?dataset=gnomad_sv_r2
I'm looking at gnomad SV . For BND it looks like the second POS is always the first POS (CH21:POS1 -> CHR2:**POS1**). Bug? https://t.co/dCustXNSJx ; https://t.co/B10LOIMpEx ; https://t.co/0dEwvmh0zE ; ...etc ... pic.twitter.com/4rhccavJ04
— Pierre Lindenbaum (@yokofakun) June 25, 2019
( I submitted an issue https://github.com/macarthur-lab/gnomad_browser/issues/139 )
Here is my problem: there is this variant:
$ bcftools view gnomad_v2_sv.sites.vcf.gz | grep gnomAD_v2_CTX_12_13 -m1
12 60718971 gnomAD_v2_CTX_12_13 N <CTX> 999 PASS END=57020218;SVTYPE=CTX;CHR2=13;SVLEN=-1;ALGORITHMS=manta;EVIDENCE=PE;CPX_TYPE=CTX_PP/QQ;PROTEIN_CODING__NEAREST_TSS=SLC16A7;PROTEIN_CODING__INTERGENIC;AN=21476;AC=1;AF=4.7e-05;N_BI_GENOS=10738;N_HOMREF=10737;N_HET=1;N_HOMALT=0;FREQ_HOMREF=0.999907;FREQ_HET=9.31272e-05;FREQ_HOMALT=0;AFR_AN=9480;AFR_AC=0;AFR_AF=0;AFR_N_BI_GENOS=4740;AFR_N_HOMREF=4740;AFR_N_HET=0;AFR_N_HOMALT=0;AFR_FREQ_HOMREF=1;AFR_FREQ_HET=0;AFR_FREQ_HOMALT=0;AMR_AN=1784;AMR_AC=1;AMR_AF=0.000561;AMR_N_BI_GENOS=892;AMR_N_HOMREF=891;AMR_N_HET=1;AMR_N_HOMALT=0;AMR_FREQ_HOMREF=0.998879;AMR_FREQ_HET=0.00112108;AMR_FREQ_HOMALT=0;EAS_AN=2226;EAS_AC=0;EAS_AF=0;EAS_N_BI_GENOS=1113;EAS_N_HOMREF=1113;EAS_N_HET=0;EAS_N_HOMALT=0;EAS_FREQ_HOMREF=1;EAS_FREQ_HET=0;EAS_FREQ_HOMALT=0;EUR_AN=7598;EUR_AC=0;EUR_AF=0;EUR_N_BI_GENOS=3799;EUR_N_HOMREF=3799;EUR_N_HET=0;EUR_N_HOMALT=0;EUR_FREQ_HOMREF=1;EUR_FREQ_HET=0;EUR_FREQ_HOMALT=0;OTH_AN=388;OTH_AC=0;OTH_AF=0;OTH_N_BI_GENOS=194;OTH_N_HOMREF=194;OTH_N_HET=0;OTH_N_HOMALT=0;OTH_FREQ_HOMREF=1;OTH_FREQ_HET=0;OTH_FREQ_HOMALT=0;POPMAX_AF=0.000561
1) The Broad uses SVTYPE=CTX : isn't it against the VCF spec ?
Value should be one of DEL, INS, DUP, INV, CNV, BND.
2) The Broad uses INFO/END as the second locus of the translocation. here chr12->60718971 chr13/END=57020218 isn't it against the VCF spec ?
3) And eventually, why can't I find this variant with bcftools (1.9-94-g9589876) or tabix ??
$ bcftools view gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
$
$ tabix gnomad_v2_sv.sites.vcf.gz "12:60718970-60718972" | grep gnomAD_v2_CTX_12_13
$
Isn't SVTYPE=CTX a hold-over from TIGRA-SV? Maybe it's due to support in GRanges readVCF for non-compliant VCF SV types? Not that this would help with BCFTools but it might be an explanation as to how these SV annotations made it into Gnomad.
https://bioconductor.org/packages/devel/bioc/vignettes/StructuralVariantAnnotation/inst/doc/vignettes.html
StructuralVariantAnnotation support structural variants reported in the following VCF notations: Non-symbolic allele Symbolic allele with SVTYPE of DEL, INS, and DUP. Breakpoint notation SVTYPE=BND Single breakend notation In addition to parsing spec-compliant VCFs, additional logic has been added to enable parsing of non-compliant variants for the following callers: Pindel (SVTYPE=RPL) manta (INv3, INV5 fields) Delly (SVTYPE=TRA, CHR2, CT fields) TIGRA (SVTYPE=CTX)