Does anyone have a good way of comparing vcf files? Essentially, we have called variants and we want to compare our set with another that we consider to be 'correct.' We want to adjust our filters so that the concordance between the two sets is maximized.
Unfortunately, this doesn't quite get us where we want. Here is a better question - how does one determine thresholds for depth and quality when filtering variants? Our idea was to compare with another set and see where concordances were maximized. The problem with vcf-compare is it examines the intersection of variants, and not the absolute set.
GATK has a tool that you can use to compute the precision and recall of a variant call set versus another variant call set (your benchmark / gold standard set). GATK GenotypeConcordance:
SiteConcordance_Summary
ALLELES_MATCH: counts of calls at the same site where the alleles match
ALLELES_DO_NOT_MATCH: counts of calls at the same location with different alleles, such as the EVAL set calling a 'G' alternate allele, and the comp set calling a 'T' alternate allele
EVAL_SUBSET_TRUTH: (multi-alleleic sites only) ALT alleles for EVAL are a subset of ALT alleles for COMP. See also below.
EVAL_SUPERSET_TRUTH: (multi-allelic sites only) ALT alleles for COMP are a subset of ALT alleles for EVAL. See also below.
EVAL_ONLY: counts of sites present only in the EVAL set, not in the COMP set
TRUTH_ONLY: counts of sites present only in the COMP set, not in the EVAL set
You can use bedtools substract to get the variant positions that are in one vcf file and not in another vcf file. This does not take genotypes into account.
The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.
Also bedtools will work for simple operations.