Question

Why do people not call normal and tumor variant separately for somatic mutation identification?

5

Entering edit mode

8.8 years ago

DVA ▴ 630

Hello,

I have been using VarScan and MuTect to call somatic variants. Lately, of curiosity, I also tried to call variants separately for the tumor sample (vs hg19) and normal sample (also vs hg19) using GATK, and compared the outputs, which is - I thought - theoretically also the somatic variants. However, the number of variants found this way is way more than the one found by VarScan or MuTect.

My questions is: What's wrong with doing so? Is that because some systematic errors in the algorithm somehow got doubled when calling variants separately?

Thank you very much.

snp • 4.6k views

ADD COMMENT • link updated 8.8 years ago by Chris Miller 22k • written 8.8 years ago by DVA ▴ 630

Ram · Accepted Answer · 2016-02-01

6

Entering edit mode

8.8 years ago

Chris Miller 22k

There are posts that hint at a lot of this if you do a search on here. The short answer is that somatic mutation calling is a different beast. To give just one example, single sample (germline) calling, you can assume that the frequency of SNPs will generally be 50% or 100%. In tumors, you have to take purity, ploidy, and tumor heterogeneity into account and SNPs may occur across a wide range of VAFs.

We get into all of this a little bit in our recent paper here: Optimizing Cancer Genome Sequencing and Analysis

ADD COMMENT • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Chris Miller 22k

0

Entering edit mode

Hi Chris, thank you very much for the quick reply. I'm going to read your paper right now. I did search briefly before this post, but couldn't obtain a satisfying answer.

So just to confirm: if I don't worry about purity, ploidy, heterogeneity or AFs, calling variants separately would be appropriate then? Thank you:)

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by DVA ▴ 630

2

Entering edit mode

Sort of. There are also sequencing artifacts to consider, which are easier to detect with joint calling (because they'll appear in both samples). There's a nice overview here in the Strelka paper:

In earlier work somatic variants have been detected by independently genotyping both samples and subtracting the results, an approach which can provide reasonable predictions for cell lines because the aforementioned variability in somatic allele frequency is reduced for this case. For the general case, a joint analysis of both samples should improve results by facilitating tests for candidate somatic alleles in both samples (especially important for indels) and enabling better representation of sequencing noise and tumor impurity. Two prevalent approaches to joint sample analysis are (i) to use joint diploid genotype likelihoods for both samples and (ii) to disregard such genotype structure and test whether a shared allele frequency between the two samples can be rejected. An implementation of the first approach is available in the SomaticSniper package (Larson et al., 2012), whereas the second approach is implemented in VarScan, which applies Fisher's exact test to the sample allele frequencies (Koboldt et al., 2009). Note that these approaches to joint sample analysis stand in contrast to solutions addressing the related problem of independent tumor sample analysis, such as in SNVMix (Goya et al., 2010), although both cases share the challenge of non-diploid tumor allele frequencies.

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Chris Miller 22k

0

Entering edit mode

Thank you very much for the information. I really appreciate it.

ADD REPLY • link 8.8 years ago by DVA ▴ 630

0

Entering edit mode

How about flipping it around. Given the availability of tools that perform joint calling of paired tumor-normal samples that do take these factors into account, why would you want to run a pipeline that does not?

Even in germline calling, there is benefit to calling multiple samples jointly (or pooled, depending on your terminology) rather than calling the samples independently (see here), and you get further benefit by incorporating pedigree information directly into the calling as done in the pedigree aware calling of RTG. (BTW, RTG also have a paired tumor-normal variant caller you may want to try).

ADD REPLY • link updated 4.9 years ago by Ram 44k • written 8.8 years ago by Len Trigg ★ 1.6k

0

Entering edit mode

Like I said, it was of curiosity:) Thank you very much for the information.

ADD REPLY • link 8.8 years ago by DVA ▴ 630