I want to analyze the flanking sequences around cancer variants. I have already downloaded cancer variants from COSMIC database. Now I am planning to extract variants from the reference build of the variant database (hg38 in this case) using bioMart. But I am having doubts regarding whether the flanking sequences properly represent what we expect from a cancer genome? For example:
AAGCT, here AA and CT are flanking sequences. What if in the original cancer bam file containing the variant G at that very position, AA and CT are mutated to AG and CA. In other words, if I want to study flanking sequences of cancer variants, is it a good idea to extract these sequences from the reference build? How much variation in the data am i losing just by doing this?
What is the question that you want to answer?