I used snpEff and have got the results vcf file.
How is it possible to find out the most common codon changes i.e (CCG (Proline) to CCA (Proline)) and their number of events (i.e 300) in non-synonymous and synonymous SNPs?
I used snpEff and have got the results vcf file.
How is it possible to find out the most common codon changes i.e (CCG (Proline) to CCA (Proline)) and their number of events (i.e 300) in non-synonymous and synonymous SNPs?
Can you paste a line from your snpEff output. The snpEff output file that I have can be easily parsed using awk one-liner.
grep "SYNONYMOUS" input.snpeff | awk '{split($0,a,"|"); print a[3]}' | awk '{split($0,b,"/"); print b[1],"\t",b[2]}'
produces the following result:
tAt tTt
cTt cGt
ggT ggC
Cga Aga
acG acT
acA acT
grep "SYNONYMOUS"
takes care of both synonymous and non-synonymous snps. You can take the output then and do the counting. Is this what you need.
Keep in mind that your INFO field with the snpEFF annotations, depending on what organism/databases you are using to annotate with, can have multiple predicted effects. So if you are dealing with human data for instance you get various annotations due to multiple transcripts overlapping a position which can have different impacts.
You can use awk and grep in combination as @Ashutosh recommended. You can also use something PyVCF to parse your VCF file programmatically, although you will have to parse the INFO field yourself to parse the snpEFF effect(s). If you are dealing with model organisms data you could also use a tool like GEMINI to parse out the top scoring impact per variant for you and have everything stored in an sqlite3 database which you can then use to do your counts.
Quite a few different ways to approach this problem depending on your level of programming comfort and what system you are working in.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.