Question

Help Needed With Annovar - Csv Summary

0

Entering edit mode

11.3 years ago

newDNASeqer ▴ 790

I tried to use AnnoVar to annotate the final VCF output files, and used the "summarize_annovar.pl" script. I got the CSV file and it's big (90 MB for 11 exome sequencing samples of cancer/tumor tissues). I need to set up a criteria to shrink the large amount of data to something more manageable and useful. The criteria I can think of are: filter out synonymous SNVs, intronic(?). I am also thinking about using Polyphen2 prediction score to screen the data, but I am not sure if this is the right way. What other criteria do you guys recommend in order to find out the variants in the cancer sample?

Also, in the AAChange column, I don't know how to understand this "uc001gkl.1:c.C2789T:p.P930L". Dos this mean two point mutations (C2789T, and P930L)? and what do the "c." and "p." mean that respectively precede the two point mutations? My hunch is "c." is for confidence and "p" is for probable? I did not find my answer on AnnoVar website, so I decided to post the questions here. Thanks

annovar vcf • 4.4k views

ADD COMMENT • link updated 11.3 years ago by Alex Paciorkowski 3.5k • written 11.3 years ago by newDNASeqer ▴ 790

score 3 · Answer 1 · 2013-08-16

It sounds like you are fairly new working with this kind of data. Here's my advice:

1) Learn some basic command-line skills in unix. grep for example will help you wrangle a great deal of this type of data

2) Keep your files in vcf format. Allows you to do so much more. Like filter from the command-line. Don't go for csv, which makes me worry you are trying to load your data all into MS excel, and you don't ever want to do that. So, for example:

$ grep nonsynonymous your_annotated_file.vcf > your_annotated_file_nonsynon.vcf

That will give you a selection of variants perhaps more interesting to you. Many also write filters in things like Perl or Python etc. Or you can use something like awk from the command line too. These are all things to read about and try.

3) Ask around other labs that are working on data like yours, read papers in your field with experiments like yours. Then you can formulate reasonable hypotheses and test them against your data. Filtering by polyphen may be a reasonable approach, it may not be, depends what you are trying to accomplish.

4) Learn about genomic notation. "uc001gkl.1:c.C2789T:p.P930L" means in gene id uc001gkl there is a coding "c" nucleotide change from C to T at position 2789 and this results in an amino acid substitution "p" from P to L at codon 930. It has nothing to do with confidence or probable or anything like that.