I am trying to annotate both coding and non-coding variants for information on COSMIC database using ANNOVAR. ANNOVAR doesn't provide direct support for the latest release of COSMIC due to licensing issues. Instead, they direct users to build their own ANNOVAR-database for COSMIC following these guidelines: http://annovar.openbioinformatics.org/en/latest/user-guide/filter/#cosmic-annotations
I am able to build the coding variants' database using the guideline, but not the non-coding ones. And the wording in the manual seems like suggesting that it is not possible to do it for non coding variants:
COSMIC changed their data formats so non-coding mutations are no longer in the MutantExport file, so we can no longer calculate their occurrences in various tumors. COSMIC now provides a CosmicNCV.tsv file, but it is not really that informative as the cancer tissue information is missing from this file.
Is there a way out to do the annotation for non-coding variants in COSMIC using ANNOVAR?
My failed attempt:
~/utils/annovar/prepare_annovar_user.pl --buildver hg19 -dbtype cosmic <(zcat CosmicNCV.tsv.gz) -vcf <(zcat CosmicNonCodingVariants.vcf.gz) > hg19_cosmicNonCoding80.txt 2> hg19_cosmicNonCoding80.log
Error: COSMIC MutantExport format error: column 17 should be 'Mutation ID'
EDIT: Cross-posted on ANNOVAR discussion board. Shall update if there is any lead. http://annovar.openbioinformatics.org/en/latest/user-guide/filter
Hi , Just a note i used ANNOVAR for long time and i changed for VEP and SnpEFF , because i found some exonic variants which were annotated as intronic by ANNOVAR. I think problems come from the database used by ANNOVAR but i never succeed to update it...
Might be due to the fact that different transcript was used by ANNOVAR. Deciding which transcript to use is extremely tricky, and non of the annotators completely solve this problem. Some use the canonical, other the longest one, and others still something else :-(
If you go with VEP/snpEff + GEMINI you get a bit of a better approach to the transcript issue. Although again, not perfect. snpEff annotates for all transcripts and GEMINI stores it, it then basically picks the transcript with the highest predicted impact. Ends up giving you some false positive hits with predicted high-impact variants in transcripts that aren't well supported but at least you end up not missing things.
based on what?
Most of time based on the longest for low annotation genes if i remember well .
snpEff translates the nucleotide variant into the protein level impact (sequence ontology). GEMINI categorizes those into HIGH, MED, and LOW categories. If there is more than one transcript whose impact is in the same category I believe it picks the longest transcript. But the HIGH, MED, LOW is basically just binning what you would expect, Stop gain, frameshift, splice donor/acceptor in high, medium is mostly missense mutations, low is synonymous and intronic, etc.
Actually, snpEFF does this categorization by itself (known as "Variant impact")!
GEMINI re-categorizes based on its own criteria from the Sequence Ontology Terms.
Thanks, I'll have a look.