We have been developing vcfanno for some time now as a tool to quickly and easily annotate (put stuff in the info field) of a VCF. It is very fast, and binaries are provided for all major systems for both 32 and 64bit.
There is help in the README, for introduction, but here we'll briefly cover the basics...
Users specify a simple conf file (TOML format) that contains the annotation files, the columns/fields to pull from each file, and the operations to perform on each file.
Simple Annotation
For example, we may wish to annotate our VCF with low-complexity regions in a BED file. That would be indicated by this conf section:
[[annotation]]
file="LCR-hs37d5.bed.gz"
names=["LCR"]
columns=[2]
ops=["flag"]
This tells vcfanno where to find the file, how to name it (LCR
) in the annotated file, which column to pull from (we just use a dummy column), and the ops
to perform on it. In this case, we just want to know if the variant overlapped with a low-complexity region so the flag
op indicates presence of an overlap.
Ops
Other ops include mean
, max
, min
, concat
, uniq
, etc. In many cases, each query variant will only overlap a single annotation, in which case, the choice of op
has little effect.
Annotate With Clinvar
Note that this is an advanced example to show the customizability of vcfanno, nearly all annotations will have a simpler configuration than does this example.
For some annotations, the built in operations are not sufficient. For example, clinvar provides a VCF where the fields are encoded. The CLNSIG field has numbers that indicate the significance of the variant that'd we have to look up. We can add these as flags to our new VCF using custom javascript op to handle the annotation. A pathogenic variant in clinvar is encoded with the number 5
, so we write the following javascript:
function clinvar_pathogenic_flag(vals){
for(i=0;i<vals.length;i++){
if(vals[i] == 5){
return true
}
}
return false
}
which returns true
if any of the CLINVAR variants overlapping the current query variant are pathogenic. We then use that javascript in the ops field as:
[[annotation]]
file="clinvar_20150305.tidy.vcf.gz"
fields=["CLNSIG"]
names=["clinvar_pathogenic"]
ops=["js:clinvar_pathogenic_flag(vals)"]
Because the javascript function ends in _flag
, vcfanno knows that a flag will be returned. Because the op starts with js:
, vcfanno knows it's javascript and evaluates the requested code with the values collected from the current variant.
Configuration
There are additional configuration examples here and here for common data sources such as ExAC, dbSNP, 1000G, fitcons, etc.
CADD
There is also support for annotating with CADD. An extensive help is available here. It allows annotating a VCF with CADD.
Feedback appreciated:
Hi, Thank you for this great tutorial! The link for CADD is broken. Would you please update it? Thank you in advance!
I updated the post to link to the new cadd docs.