I have downloaded the COSMIC mutation file based on GRCH38. I have the cosmic mutation ids for each mutation (eg, COSM521,COSM520 etc). If I copy these ids and check in the search box of the website I get all the related information such as its emsembl contig etc. Using these ENSEMBL contigs, I visit the ENSEMBL database and extract the sequence associated with this variant. Is there any way to extract all the cosmic variant sequences from the ENSEMBL database without doing this individually for all? In other words, how to map the COSMIC variants with that of the ENSEMBL ones?
Hello,
it is unclear to me what you mean by
Do you mean how the sequence change due to the variant e.g for COSM521 A>G? Isn't this information in the file you've downloaded?
fin swimmer
Sorry for the confusion. I meant flanking sequence containing the variant position
What do you mean by "sequence"? The flanking region? Just the base-pair change?
The flanking region containing the variant
if you have reference sequence (in this case GRCh38), get flank in bedtools (https://bedtools.readthedocs.io/en/latest/content/tools/flank.html) will give flank ranges and using getfasta, get the flanking sequences from above created ranges (https://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html)
Thanks for your reply. So my question is do I need to incorporate variant information in the reference sequence? What if the flanking region for a SNP is part of a INDEL. If I don't incorporate the INDEL variant into the refseq, wouldn't I lose information? Or should I just use the refseq as it is?