Greetings,
I have been working on a comparison of a human endocrine cancer, matched with normal tissue from the same organ. The tissue was dissected by a pathologist, and is considered quite pure. We sequenced to 80x depth in the tumor 40x in the normal. Somatic variant calling was done with Varscan2 (samtools -q 1 to use only mapped reads, and the high confidence set). Indels and SNPs were filtered and annotated; we are looking at coding, non-synonymous mutations and splice site mutations. In our small cohort of 15 patients, we looked for those mutations that are in more than one tumor sample. Using this list, I was pretty surprised; about 1/2 the variants are found in dnSNP 1.3.7 and most in 1.3.1. About 1/4 are in the COSMIC database (although some of those are in dbSNP as well). I pressed on to manual verification in IGV.
The surprises continue; many of the sites called as a somatic mutation are in fact germline; there are more reads in the tumor, but often times in the normal the variant allele frequency is similar or the same. Many of the sites are in dbSNP. Many are in low-coverage regions.
I feel like I did what I could to get quality, somatic mutations out of varscan. I followed the entire GATK pipeline to indel-realign, recal, etc. I used only mapped reads in samtools when piping into varscan. And from varscan, we used the high confidence set. Yet everything is such a soft call... a 'somatic' mutation in one patient has some reads in the normal, and other samples have the variant as a polymorphism as well. I tried to use the preparation and variant calling pipeline to solve this problem, yet here I am at the end, feeling like I'm back to square one. Is there something obvious that I'm missing? (I have also used Mutect, Strelka, GATK, SomaticSniper, but the number of false positives seemed to be the least with Varscan2).
Thank you for the feedback, AOC
AOC,
I recently ran into a similar problem. I used MuTect to produce a somatic variant callset but noticed that many of the variants seem to come from reads with low mapping quality. After filtering for variant read MQ, the remaining variants seemed to be of high quality. We then went on to validation of the most interesting variants only to discover that many of those variants were due to systematic error in the sequencing technology; the variants which we thought were somatic were present at very low frequency in all samples (tumor and normal). To correct this in the future, we are filtering out variants in our somatic callset that are present in any normal sample or dbSNP. We are also annotating variants in high-GC regions, as the false positives seem to consistently come from those regions.
I would guess that many of your variants are likely due to systematic error as well, and the increased frequency of those variants in the tumor sample is due to chance rather than anything of biological significance. Probably the best and easiest solution is to filter out any called somatic variants present in any normal samples, although this may be a little drastic depending on what you are looking for.