Question

Comparing VCF Files

1

Entering edit mode

10.8 years ago

Daniel E Cook ▴ 280

Does anyone have a good way of comparing vcf files? Essentially, we have called variants and we want to compare our set with another that we consider to be 'correct.' We want to adjust our filters so that the concordance between the two sets is maximized.

vcf variant calling • 8.2k views

ADD COMMENT • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Daniel E Cook ▴ 280

0

Entering edit mode

Also bedtools will work for simple operations.

ADD REPLY • link updated 3.4 years ago by Ram 45k • written 10.8 years ago by Biomonika (Noolean) 3.2k

Ram · Answer 1 · 2014-08-01

1

Entering edit mode

10.8 years ago

Pierre Lindenbaum 166k

vcf-compare?

ADD COMMENT • link updated 5.6 years ago by Ram 45k • written 10.8 years ago by Pierre Lindenbaum 166k

1

Entering edit mode

Unfortunately, this doesn't quite get us where we want. Here is a better question - how does one determine thresholds for depth and quality when filtering variants? Our idea was to compare with another set and see where concordances were maximized. The problem with vcf-compare is it examines the intersection of variants, and not the absolute set.

ADD REPLY • link 10.8 years ago by Daniel E Cook ▴ 280

1

Entering edit mode

so you can first filter you vcf and then compare it with another set isn't it ?

ADD REPLY • link 10.8 years ago by Pierre Lindenbaum 166k

Ram · Answer 2 · 2014-08-07

GATK has a tool that you can use to compute the precision and recall of a variant call set versus another variant call set (your benchmark / gold standard set). GATK GenotypeConcordance:

SiteConcordance_Summary

ALLELES_MATCH: counts of calls at the same site where the alleles match
ALLELES_DO_NOT_MATCH: counts of calls at the same location with different alleles, such as the EVAL set calling a 'G' alternate allele, and the comp set calling a 'T' alternate allele
EVAL_SUBSET_TRUTH: (multi-alleleic sites only) ALT alleles for EVAL are a subset of ALT alleles for COMP. See also below.
EVAL_SUPERSET_TRUTH: (multi-allelic sites only) ALT alleles for COMP are a subset of ALT alleles for EVAL. See also below.
EVAL_ONLY: counts of sites present only in the EVAL set, not in the COMP set
TRUTH_ONLY: counts of sites present only in the COMP set, not in the EVAL set

http://www.broadinstitute.org/gatk/gatkdocs/org_broadinstitute_gatk_tools_walkers_variantutils_GenotypeConcordance.html

You can use bedtools substract to get the variant positions that are in one vcf file and not in another vcf file. This does not take genotypes into account.

http://bedtools.readthedocs.org/en/latest/content/tools/subtract.html

score 0 · Answer 3 · 2014-08-08

Is this what you're after?.. GATKs Variant Quality Score Recalibration

From the website:

Introduction

The purpose of variant recalibration is to assign a well-calibrated probability to each variant call in a call set. This enables you to generate highly accurate call sets by filtering based on this single estimate for the accuracy of each call.