Extracting annotation values from vcf files for plotting distribution
3
0
Entering edit mode
8.9 years ago
kirannbishwa01 ★ 1.6k

I want to extract the annotation values (QUAL, BaseQRankSum, ClippingRankSum, DP, FS, MQRankSum, etc.) of the variants (SNPs and indels) called in my genome reseq data and

  1. I want to plot the distribution of these values before proceeding to stringent filtering.
  2. I also want to plot the correlation between several annotation values for the called variants.

A part of the variants_MA605.vcf file looks like this:

#CHROM    POS    ID    REF    ALT    QUAL    FILTER    INFO    FORMAT    MA605**`
scaffold_1111    62    .    T    A    61.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=0.358;ClippingRankSum=-1.231;DP=5;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=37.19;MQRankSum=-1.231;QD=12.35;ReadPosRankSum=0.358;SOR=1.022    GT:AD:DP:GQ:PL    0/1:2,3:5:73:90,0,73
scaffold_1111    301    .    G    A    119.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=2.227;ClippingRankSum=-1.598;DP=73;FS=0.000;MLEAC=1;MLEAF=0.500;MQ=27.33;MQRankSum=1.356;QD=1.64;ReadPosRankSum=1.404;SOR=0.596    GT:AD:DP:GQ:PL    0/1:59,11:70:99:148,0,1738
scaffold_1111    340    .    C    T    105.77    .    AC=1;AF=0.500;AN=2;BaseQRankSum=1.547;ClippingRankSum=-0.490;DP=33;FS=9.645;MLEAC=1;MLEAF=0.500;MQ=22.79;MQRankSum=1.351;QD=3.21;ReadPosRankSum=1.116;SOR=2.799    GT:AD:DP:GQ:PL    0/1:23,10:33:99:134,0,601

Using SnpSift (part of SnpEff); command:

java -jar SnpSift.jar extractFields variants_MA605.vcf CHROM POS ID AF QUAL > raw01VarMA605qual.txt

The output text file is like:

#CHROM    POS    ID    AF    QUAL
scaffold_1111    62        0.500    61.77
scaffold_1111    301        0.500    119.77
scaffold_1111    340        0.500    105.77

While the extraction of the QUAL values (and other string values: CHROM, REF, ALT) has been clear and straight forward I am not able to pull the annotation values for AC, BaseQRankSum, ClippingRankSum, etc. because they are multiple annotation values under INFO field. I have checked the documentation but its been not so clear and successful. How can I extract this INFO fields separately so I can test for correlation between the annotation values?

I have been SnpSift to get the values for QUAL in text file and R to do the distribution plotting. Are there any other tools than SnpSift that may do a better job of extracting the annotation and give the appropriate plots?

Thanks in advance!

SNP annotation • 5.9k views
ADD COMMENT
1
Entering edit mode
ADD COMMENT
1
Entering edit mode
8.9 years ago
Andreas ★ 2.5k

Another tools to extract values from a vcf file: vcf_get_val.py (requires pyvcf)

Andreas

ADD COMMENT
1
Entering edit mode
8.9 years ago

Use bcftools query subcommand like this (untested):

bcftools query -f "%CHROM\t%POS\t%ID\t%INFO/AF\t%QUAL\t%INFO/BaseQRankSum" $vcf_file

However, also SnpSift seems to be able to do this. In fact, in your example you extracted the AF field which is an INFO-Tag! Or am I wrong??

ADD COMMENT
0
Entering edit mode

Thanks for seeing that. I was totally unaware that it had pulled the AF field values.Thanks for pointing that to me.

ADD REPLY

Login before adding your answer.

Traffic: 1875 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6