Say, we have called a list of variants from a tumor sample (normal control not available) and we try to set a number of filtering criteria to separate somatic mutation from germline mutation as well as false positive mutation. We have set some criteria as listed below. For each variant called:
1) ignore position with depth < 50
2) ignore variant call quality < 30
3) ignore allele frequency < 0.02
4) ignore allele frequency equal to 0.5 or > 0.9
5) ignore variant appear in dbSNP
6) ignore variant not in COSMIC
7) check neighboring variants and see if they have similar allele frequency. If yes, this region maybe polyploid and all variants within may actually be germline mutation. For example allele frequency of 4 neighboring variants: 0.23; 0.21; 0.19; 0.24 within tetraploid
Can anyone share some comments on whether any of these criteria sounds unreasonable? More importantly, can you think of any more criteria that may help to get real somatic mutation? Really appreciate.
Can anyone share some opinions on this topic? Really appreciate!
I strongly recommend against filtering with dbSNP. dbSNP contains somatic variants that have functional consequences and can be disease-driving. Better filter against the 1000Genomes project, as these variants reflect common human variation. This could be done with the Variant Effector Predictor from Ensemble.
Agree. Thanks for sharing
By using VEP to perform that analysis, what would be the flag that would point to a potential germline variant? Thanks!
You mean germline 'variant'? You could check the 1000 Genomes minor allele frequencies to infer whether or not the variant in question has been observed in this cohort, and thus, infer whether or not it is a germline variant - nothing can ever 100% guarantee that it is germline, though. You will require the matched normal sample for that. Also remember that some germline variants can increase risk of cancer.
Thanks, @Kevin Blighe! Always very helpful. Yes, I meant "variant". Sorry about that. I corrected it now. (: And yes, I totally agree that this should be only a rough estimation. In the case you're describing here, would enabling the option "Exclude common variants" on the VEP tool achieve that?
To a certain extent it would achieve that - yes. The 'Exclude common variants' filter is explained like this:
[source: https://uswest.ensembl.org/info/docs/tools/vep/online/input.html]
So, it filters based on a MAF cut-off.
Provided that you document everything that you do, the specificities of the filtering steps can be debated at a later time, possibly by peer reviewers.
Sorry about this late reply. Somehow some replies were not showing on my feed. Yes, I had read the explanation for that before, but I wasn't still very sure whether by using this would be enough to estimate the somatic variants. And after running VEP, I saw that actually a lot of variants were filtered out. And by skimming through them, they made sense. Thank you again, @Kevin Blighe! Honestly, if it wasn't for your input in a lot of threads here, it would take me 10 times more to mine the answers. Thank you! And many members in this awesome community.
Also, I considered using Mutect2 in the tumor-only mode, but it seems that there is still the need to build a PoN. And from what I understood there, a way to call the somatic variants in tumor only samples would be to first create a PoN from all the tumors, identifying the overlapping variants, and then exclude them from each sample? Is that correct?
I am not too familiar with Mutect2, as I use Lancet for somatic variant calling. Would you not have to create the PoN from actual normal samples, though? I guess that the logic about common variants in the tumours is that these would not normally occur by chance in the context of somatic mutation, and could therefore be assumed to be germline variants.
My impression is that PoN in MuTect2 is primarily used to exclude artifacts that occur frequently across your cohort.
Thank you
Thank you, @CY. And as @Kevin Blighe pointed out above, do you think that by eliminating the common variants found in my samples (cancer only), this could be used as a step to filter out potential germline variants? Would it improve the performance if used jointly with the VEP analysis, or most of it would likely overlap with what is found on VEP?
Yes, I think geting ride of non-reference site reccurrently appear in the PoN would practically filter out both artifacts and germline variants.
Thank you, @CY It does make a lot of sense.