Hi all!
I annotated a VCF file using VEP and noticed that it reports several variant IDs to each input variant. For example, this is an excerpt of one of the variant lines (I removed the annotation info that is not relevant to the question):
12 25398284 . C A . . rs121913529&COSV55497369&COSV55497419&COSV55497479
As you can see, for this variant three diferent COSMIC ids are reported although only one of them (COSV55497419) corresponds to the alternate allele that is found (C>A). The rest of ids refer to other alternate alleles that can also be found at that position.
After reading VEP's documentation I know this is the expected behavior, but I am kind of confused about the following lines. I am not really sure I understand what it is refering to as "variants with unknown alleles":
For some data sources (COSMIC, HGMD), Ensembl is not licensed to redistribute allele-specific data, so VEP will report the existence of co-located variants with unknown alleles without carrying out allele matching. To disable this behaviour and exclude these variants, use the
--exclude_null_alleles
flag.
Just in case, I repeated the annotation using the --exclude_null_alleles
flag but the output for the ids is now blank for COSMIC, only the dbSNP code is reported.
So basically I would like to have only the specific COSMIC id for my variant. Does anyone know how can I perform the annotation with VEP so it only reports the specific COSMIC id of the alternate allele that is present in my VCF?
Thanks a lot for reading!!!
It is pretty trivial to do the COSMIC annotation without tools such as VEP - download their coding and non-coding VCFs, normalize (decompose/left-align) and merge them to create one annotation VCF. Use this with bcftools annotate to get the annotations you need.
Thank you both for your answers Joel Wallenius Ram !
For anyone facing the same problem, I followed Ram's suggestions and it worked beautifully. After merging the vcfs, annotate using:
This command adds the single COSMIC id that correspond to your variant's specific allele to the ID field of the input, and also appends the INFO section for each variant.