Question

GATK snap calling - combine exome and WGS?

0

Entering edit mode

3.2 years ago

gsnps • 0

Hi there

I'm calling SNPS on human genome using GATK.

For each sample I have both low coverage WGS data and also exome data.

I'm calling SNPs per sample using HaplotypeCaller and producing gVCFs to later combine across samples. Everything works just great and as expected.

But I'm wondering should I combine the Exome + WGS data for each sample or call on them separately?

So should I for example.. run CombineGVCFs once on all the WGS gVCFs and once on all the Exome gVCFS to produce one merged file for exome and one for WGS?

Or should I throw everything together and produce one final merged VCF? And if I do this how should I tell GATK which Exome reads are from the same sample as which WGS reads? Read groups?

Thanks..

GATK • 1.3k views

ADD COMMENT • link updated 3.2 years ago by vdauwera ★ 1.2k • written 3.2 years ago by gsnps • 0

score 2 · Answer 1 · 2021-08-30

2

Entering edit mode

3.2 years ago

vdauwera ★ 1.2k

This is kind of a tough question. The problem with throwing everything together at the read stage is that it introduces confounding effects that will mess things up at the filtering stage, because WGS and exome data have very different error modes. And at the gvcf stage you run into the issue that the program won’t know how to combine statistics from the same sample. So I would recommend calling the WGS and exome runs separately, then defining some custom logic for how you want to combine the information from the final vcfs.

For example you can decide to use concordance between the two data types as an indication of quality, and apply more stringent filters to anything that is not concordant (which means you also need to decide which one you trust more). Basically you need to decide what is the value of the information contributed by the WGS vs the exome. There’s unfortunately not a one size fits all answer to that question.

ADD COMMENT • link 3.2 years ago by vdauwera ★ 1.2k

0

Entering edit mode

That is a really great answer thank you!!

"because WGS and exome data have very different error modes." - I had a feeling this would be the problem.

"apply more stringent filters to anything that is not concordant" - That makes a lot of sense. Good idea.

Ok thanks. I am merging the exome data and the WGS data across samples separately.

By the way do you have any feel for number of samples needed to apply Variant Quality Score Recalibration to the WGS? I found a couple of sources in the GATK documentation saying at least 30 samples for the exome (and to pad out with 1000 Genomes data if you have less). What I've read seems to suggest that on WGS even a single sample is ok.. but is there a practical minimum?

ADD REPLY • link 3.2 years ago by gsnps • 0

1

Entering edit mode

You're welcome!

For filtering, those are indeed the standard recommendations. VQSR can usually run successfully even on just one WGS sample, though you may get better results with the newer method, GATK CNN, which uses deep learning and does a better job on indels especially. For anything above one WGS sample you should be fine with VQSR. The reason exomes have those 30+ recommendations is because there are so much fewer variants in a sample's exome, you need a lot of them to accumulate enough variants to build a stable model using VQSR's internal algorithm.

Good luck!

ADD REPLY • link 3.2 years ago by vdauwera ★ 1.2k

0

Entering edit mode

Don't add answers when you mean to reply to someone's post. Only add answers if you're answering the top level post.

ADD REPLY • link 3.2 years ago by Ram 44k