Question

Discrimination Between Germline And Somatic Mutations In Tumor Without The Availability Of The Normal Paired Sample

31

Entering edit mode

11.7 years ago

Fred Fleche 4.3k

Hi,

Let's say that I get whole-exome-sequencing data file that has been created without the availability of a normal sample related to the tumor sample sequenced. Is there a way to make the disctinction bewteen germline and somatic variants. I was thinking of comparing the variants against the COSMIC (Catalog Of Somatic Mutations In Cancer) database.

So I was wondering if some people have some suggestions of a nice accurate workflow with other sources than COSMIC.

Thanks,

Fred

mutation somatic • 21k views

ADD COMMENT • link updated 11.7 years ago by Stefano Berri 4.4k • written 11.7 years ago by Fred Fleche 4.3k

score 45 · Answer 1 · 2013-02-27

Here is what I do:

Flag known germline variants by looking in dbSNP. I use a subset of dbSNP (> 1% minor allele frequency, mapping only once to reference assembly, and not flagged as "clinically associated"). You can get such a file for ANNOVAR (database name is snp137NonFlagged for the current dbSNP build), see http://www.openbioinformatics.org/annovar/annovar_download.html
Flag known somatic variants by looking in COSMIC. This usually finds well-described hotspot mutations (such as activating KRAS mutations), but overall will not find most of your true somatic variants (my guess). I usually take the whole of COSMIC, irrespective of tumor type.
Add other cancer sequencing studies (e.g. TCGA), as many of these are not yet in COSMIC currently. For TCGA, I use the MAF files available at https://tcga-data.nci.nih.gov/tcgafiles/ftp_auth/distro_ftpusers/anonymous/tumor/. Level 3 MAF files contain experimentally validated somatic mutations only. Level 2 MAF files contain also the unvalidated ones (and can contain germline variants).
Look at the variant allele frequency. If it's 100%, i.e. all reads show the variant, it's very likely germline (unless your tumor sample is 100% tumor cells and all tumor cells have the mutation). If it's below 10%, it can well be an artifact, see e.g. http://www.ncbi.nlm.nih.gov/pubmed/23303777
Check how all of the mismatches in your data (non-reference bases in the alignment) are distributed along the reads from 5' to 3'. If you have a much higher mismatch rate at the first/last bases of your reads, you might want to exclude these read positions.
Filter your variant list further, as it will likely contain a considerable amount of false positives. Table 1 of the VarScan paper http://www.ncbi.nlm.nih.gov/pubmed/22300766 is a good start (read pos, strand, variant read number and frequency, distance to 3', homopolymer, map quality and read length difference).

score 13 · Answer 2 · 2013-02-27

Looking at already known cancer mutation is fine, but you can tell only about what it is already known.

Personally, I would look at frequency of mutations. If it is germline it is either 100% or 50% (clearly, not exactly 50%, but around there).

If it is a somatic mutation and your samples are from clinical samples (not cell lines), then infiltration with normal cells is inevitable and your mutations will be at 30-40%

If coverage is enough, you might confidently distinguish between the two.

To better understan what I mean, I suggest you this great paper