I tried to use AnnoVar to annotate the final VCF output files, and used the "summarize_annovar.pl" script. I got the CSV file and it's big (90 MB for 11 exome sequencing samples of cancer/tumor tissues). I need to set up a criteria to shrink the large amount of data to something more manageable and useful. The criteria I can think of are: filter out synonymous SNVs, intronic(?). I am also thinking about using Polyphen2 prediction score to screen the data, but I am not sure if this is the right way. What other criteria do you guys recommend in order to find out the variants in the cancer sample?
Also, in the AAChange column, I don't know how to understand this "uc001gkl.1:c.C2789T:p.P930L". Dos this mean two point mutations (C2789T, and P930L)? and what do the "c." and "p." mean that respectively precede the two point mutations? My hunch is "c." is for confidence and "p" is for probable? I did not find my answer on AnnoVar website, so I decided to post the questions here. Thanks
thanks Alex for your detailed reply. Another question related to AnnoVar:
As my VCF file contails 11 samples, when I use annovar to annotate the VCF time, will the final output retail the same order of 11 samples? I tried to verify this myself, but have not matched the order in the annovar output with the input VCF file.