Question

Good annotation databases for a VCF file (somatic variants)

3

Entering edit mode

3.5 years ago

Zahra ▴ 110

Hi all,

I have some genes and want to investigate the status of their mutations during cancer so I annotated my VCF file (for somatic variants) by ANNOVAR and I used some databases for annotation (i.e. icgc28, nci60, Noncoding_CosmicV92, Coding_CosmicV92, cadd13gt10, dann, clinvar_20210501, avsnp150, gnomad211_exome, gnomad211_genome, hrcr1, cg46, cg69, kaviar_20150923, refGeneWithVer, knownGene, ensGene, cytoBand, genomicSuperDups, tfbsConsSites, wgRna, gwasCatalog, abraom, dbnsfp41a, eigen, esp6500siv2_all, exac03, gme, intervar_20180118 ) but I’m not sure if I need all of them ?! :|

which are helpful and preferred to annotating a VCF file for somatic variants?

Thanks for any help or suggestion.

VCF ANNOVAR somatic databases annotation • 5.4k views

ADD COMMENT • link updated 3.4 years ago by emma.a ▴ 130 • written 3.5 years ago by Zahra ▴ 110

1

Entering edit mode

If you are interested in driver mutations that occur in protein coding regions, then cadd/dann/eigen have suboptimal performance, as they were not made for somatic mutations in cancer or optimized for protein coding-alterations. Combining multiple predictors won't alleviate this problem (I've tried). You would be much better off using variant predictors designed for somatic mutations in cancer or at least have been benchmarked to have good performance (see https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-01954-z ). That benchmark would suggest CHASM, CTAT-cancer, DEOGEN2, or PrimateAI. Since I'm the developer of CHASMplus, a considerably better version of CHASM that was included in the benchmark, I would suggest CHASMplus, but you would need to either run the VCF file on the opencravat webserver or use a downloadable command line tool (see https://open-cravat.readthedocs.io/en/latest/ ).

ADD REPLY • link 3.5 years ago by Collin ▴ 1000

score 3 · Answer 1 · 2021-06-14

3

Entering edit mode

3.5 years ago

Cyriac Kandoth 6.1k

To identify cancer driving somatic mutations with high specificity, always start with manually curated knowledgebases - ClinVar, OncoKB, CIViC - these links show you how they manually classify variants based on literature, and if/how they are useful to your work. If your work needs high sensitivity for cancer drivers and can tolerate lower specificity (e.g. basic research), then try the pathogenicity prediction algorithms that Collin mentioned above - CHASM, CTAT-cancer, DEOGEN2, or PrimateAI.

Instead of ANNOVAR, I recommend vcf2maf that annotates a VCF using Ensembl's VEP and produces a tab-delimited Excel-friendly output file. These are some columns that are most useful for distinguishing driver/passenger somatic mutations. Having your mutations in MAF format also enables the use of maftools for fancy plots and the oncokb-annotator to bulk annotate variants. Also look at CIViCpy which will need some scripting to bulk annotate all your variants.

ADD COMMENT • link 3.5 years ago by Cyriac Kandoth 6.1k

1

Entering edit mode

Good points Cyriac. I agree that computational predictors are likely to have greater false positives, and manually curated DBs should be the first option. But I do think its important to also be aware of the limitations of manually curated databases, as well. First, these databases are highly incomplete and so not being in the database does not imply the mutation is a passenger (i.e. low sensitivity). Second, the databases often don't clarify whether the underlying evidence is strictly based on functional impact on a protein, or has direct evidence for altering tumor growth (i.e. oncogenicity). This could lead to a couple of problems: 1) "gain-of-function" impact assessed in a non-cancer context assumed to be "likely oncogenic"; 2) Use of a particular altered function which has not previously been established as a valid surrogate for oncogenicity in that gene. Third, sometimes the databases try to annotate "loss-of-function" or "gain-of-function" based on rule-of-thumbs, such as frameshift indels must be "loss-of-function". However, as we point out in a recent paper, such rule-of-thumbs might not necessarily be right all of the time (https://pubmed.ncbi.nlm.nih.gov/33567269/ ). For example, GATA3 frameshift indels increase GATA3 protein expression ("gain-of-function" effect) through loss of a degron. So it is not clear whether, for example OncoKB, should have labeled GATA3 as a tumor suppressor with truncating mutations as "loss-of-function".

ADD REPLY • link 3.5 years ago by Collin ▴ 1000

1

Entering edit mode

All good points, Collin. You are describing the nuance at the frontier of this domain that the OP will soon learn about. :) I updated my answer to emphasize the sensitivity/specificity tradeoffs.

ADD REPLY • link 3.5 years ago by Cyriac Kandoth 6.1k

score 2 · Answer 2 · 2021-05-29

2

Entering edit mode

3.5 years ago

emma.a ▴ 130

I personally use with annovar: refGene, ensGene, gnomad211_exome, gnomad211_genome, gnomad30_genome, icgc28, cosmic92_coding, cosmic92_noncoding, dbnsfp41a

you can add TCGA info and there are also another somatic mutations databases that you can convert in annovar custom databases ...

AS always, depends what are you looking for at the end, after annotations ...

Best

ADD COMMENT • link 3.5 years ago by emma.a ▴ 130

0

Entering edit mode

Thanks for your kind reply, as you said, I downloaded my interest databases (e.g. DoCM, CIViC) and tried to convert them to ANNOVAR custom database, but I couldn't do this conversion. Would you mind helping me by pointing to a helpful link, script, or paper for this conversion? I couldn't find the practical descriptions in the ANNOVAR database.

Thanks in advance

ADD REPLY • link 3.5 years ago by Zahra ▴ 110

1

Entering edit mode

The info that you need for a custom annovar database are the following columns:

"chr" "start" "end" "ref" "alt" "info1" "info2" "info3" ... etc.

Just open a ready-to-use annovar database and check how it's.

At the end you have to index your custom database.

ADD REPLY • link 3.4 years ago by emma.a ▴ 130

0

Entering edit mode

Dear Emmanouil. I cannot find the link to download icgc28, cosmic92_coding, cosmic92_noncoding, can you tel me where I can download it? Thanks you so much

ADD REPLY • link 3.4 years ago by dophuochuy94 • 0

0

Entering edit mode

Hi!

In the annovar site you can find only a very old version of COSMIC. For a new one you have to create it. You can download from the COSMIC site the vcf files and convert them in annovar tables/database.

Here the download page of Annovar. ICGC database is under "hg38 - icgc28 - International Cancer Genome Consortium - version 28 - 20210122"

ADD REPLY • link 3.4 years ago by emma.a ▴ 130