Question

Using GIAB exome data to test targetome variant-calling pipeline

0

Entering edit mode

9.0 years ago

Floydian_slip ▴ 170

Hi,

I am trying to test my variant calling pipeline that I have prepared for my target region of 5Mb for which I need a test dataset where the fastq files and the VCF files are provided so that I can run my pipeline on those fastqs and then compare my VCF to those VCFs. So, I used the two whole exome datasets from Genome In a Bottle data with the following steps:

In order to obtain the fastqs that originated from my target region, I used the bam files published by GIAB and extracted the alignments falling in my target region using samtools
Converted the extracted bams in those regions to paired-end fastqs
Aligned those fastqs to the entire genomeusing BWA
Called the variants in my region of interest using GATK (with -L option, restricting my analysis to my target region for RealignerTargetCreator and HaplotypeCaller).
Compared my variants to the variants from GIAB in the target region.

But I only get 50% variant calls correctly. I call only 50% of the total variants in that region. What could be the reason for this low agreement rate?

Interestingly, when I call the variants on the whole exome for which the fastqs were originally created, I can obtain 96% of variants in the whole-exome region published by GIAB. Moreover, from those variants, when I extract the variants that fall in my target regions and compare it to the corresponding GIAB variants, I can go upto 97%. In other words, when I use the entire whole-exome data, I can call 97% of the variants in my target region, but when I start with the alignments falling in my target region, convert to fastq and then call variants in the target region, I only get 50%.

Can somebody help me figure out what could be going wrong?

Thanks,

~N

whole-exome targetome variant-calling GIAB • 2.7k views

ADD COMMENT • link updated 6.4 years ago by Ram 44k • written 9.0 years ago by Floydian_slip ▴ 170