Meaning of the FORMAT fields for the Genome In A Bottle reference VCF file
0
0
Entering edit mode
6.4 years ago

After reading the GIAB paper in https://www.biorxiv.org/content/early/2018/05/25/281006 and its Figure 1, I am still having trouble understanding the data inside the GIAB VCF file for HG001 (HG001_GRCh38_GIAB_highconf_CG-IllFB-IllGATKHC-Ion-10X-SOLID_CHROM1-X_v.3.3.2_all.vcf). In particular, I need to understand the information that led to a FILTER value of PASS for some variants.

The first case that I would like to discuss is the following variant:

chr1 110176467 . A G 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable GT:PS:DP:ADALL:AD:GQ 0/1:.:512:114,106:145,148:760

Translating the information in the line to English, this is what I get:

  • We considered data from the following sequencing technologies: Illumina,CG,10X,IonExome,SolidPE50x50bp, and SolidSE75bp.
  • Of those, some of them (IonExome,SolidPE50x50bp,SolidSE75bp) were missing the call, so we discarded them.
  • Therefore, the data for the variant is going to come from 3 platforms (Illumina,CG,10X), 3 datasets (HiSeqPE300x,CGnormal,10XChromium), and 4 callsets (HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo), because we analyzed the data from HiSeqPE300x with GATK and also with Freebayes.
  • However, the callsets from CS_HiSeqPE300xGATK,CS_CGnormal, and CS_HiSeqPE300xfreebayes had this call in a region with low coverage of high MQ reads.
  • Across all the platforms, DP=512
  • ADALL (Values 114,106) is coming from “all" the datasets. Are those “all” the set (HiSeqPE300x + Cgnormal + 10XChromium), or the set (Illumina + CG + 10X + IonExome + SolidPE50x50bp + SolidSE75bp).
  • AD (Values 145,148) is coming from the same “all” datasets as ADALL, but they are unfiltered, so 145>114 and 148>106.

All this make sense, but I would like to know what case of Figure 1c was applied to arrive at the PASS value.

The second variant that I would like to mention is this one:

chr1 5705293 . T C 50 PASS platforms=3;platformnames=Illumina,CG,10X;datasets=3;datasetnames=HiSeqPE300x,CGnormal,10XChromium;callsets=4;callsetnames=HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo;datasetsmissingcall=IonExome,SolidPE50x50bp,SolidSE75bp;callable=CS_HiSeqPE300xGATK_callable,CS_CGnormal_callable,CS_HiSeqPE300xfreebayes_callable;filt=CS_HiSeqPE300xGATK_filt GT:PS:DP:ADALL:AD:GQ 0/1:.:599:121,139:49,28:627

Again, my translation would be:

  • We considered data from the following sequencing technologies: Illumina,CG,10X,IonExome,SolidPE50x50bp, and SolidSE75bp.

  • Of those, some of them (IonExome,SolidPE50x50bp,SolidSE75bp) were missing the call, so we discarded them.

  • Therefore, the data for the variant is going to come from 3 platforms (Illumina,CG,10X), 3 datasets (HiSeqPE300x,CGnormal,10XChromium), and 4 callsets (HiSeqPE300xGATK,CGnormal,HiSeqPE300xfreebayes,10XGATKhaplo), because we analyzed the data from HiSeqPE300x with GATK and also with Freebayes.

  • However, the callsets from CS_HiSeqPE300xGATK,CS_CGnormal, and CS_HiSeqPE300xfreebayes had this call in a region with low coverage of high MQ reads.
  • Also, the data from CS_HiSeqPE300xGATK_filt was filtered/discarded
  • Across all the platforms, DP=599
  • ADALL (Values 121,139) is coming from “all" the datasets. I assume (Cgnormal + 10XChromium), because CS_HiSeqPE300xGATK_filt was filtered.
  • AD (Values 49,28) are unfiltered.

But notice that 49<121 and 28<139. How come that there are less unfiltered reads (AD)? Should not be the case that AD > ADALL for all instances?

I don’t know how to reconcile the two variants. Aren’t they providing opposite information in the AD and ADALL fields? Which specific datasets and callsets are involved in getting AD and ADALL for each variant?

I would also appreciate answers to a couple of other questions:

  • Why are these two variants a PASS if there are 3 “callable” datasets in regions of low coverage?

  • Did the variants follow an arbitration process? I guess not, if the INFO flag “arbitrated” is missing.

sequencing genome hg38 vcf • 1.7k views
ADD COMMENT

Login before adding your answer.

Traffic: 2754 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6