Hi, I would like to quick compare genotypes in two VCF files for a set of individuals. Any quick method?
Hi, I would like to quick compare genotypes in two VCF files for a set of individuals. Any quick method?
If you want to directly compare genotypes, look at vcfgtcompare in vcflib. You can use it like this:
vcfgtcompare.sh other this.vcf other.vcf
The output will be this.vcf with genotype concordance annotations provided in the INFO column:
##INFO=<ID=other.has_variant,Number=0,Type=Flag,Description="True if other has a called alternate among samples under comparison.">
##INFO=<ID=other.genotypes.AA_AA,Number=1,Type=Integer,Description="Number of genotypes with AA_AA relationship with other">
##INFO=<ID=other.genotypes.AA_AR,Number=1,Type=Integer,Description="Number of genotypes with AA_AR relationship with other">
##INFO=<ID=other.genotypes.AA_RR,Number=1,Type=Integer,Description="Number of genotypes with AA_RR relationship with other">
##INFO=<ID=other.genotypes.AA_NN,Number=1,Type=Integer,Description="Number of genotypes with AA_NN relationship with other">
##INFO=<ID=other.genotypes.AR_AA,Number=1,Type=Integer,Description="Number of genotypes with AR_AA relationship with other">
... etc.
I typically use the other.genotypes.AA_AA
, other.genotypes.AA_AR
, etc. to tabulate concordance statistics. Here AA_AR
means that in the first file the genotype was alternate/alternate, and in the second it was alternate/reference. N
means "no-call" or partial no-call, which might happen if you were to remove non-intersecting alleles using vcfintersect. To do this for many loci, I drop the data into tab-separated format using vcf2tsv:
% vcfgtcompare.sh other this.vcf other.vcf | vcf2tsv >this_other_comparison.tsv
% R
> this_other <- read.delim("this_other_comparison.tsv")
> this_other[is.na(this_other)] <- 0 # converts NA's to 0 for later processing
Then I use this function to tabulate genotype concordance in R:
gterror <- function(s, n) {
with(s,
(sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))
/ ((sum(eval(as.symbol(paste(n, ".genotypes.RR_RR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AA_AA", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AR_AR", sep="")))))
+ (sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
+ sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))))
}
In this way (in R):
> gterror(this_other, "other") # yields rate of discordance between the two sets
There is a tool "vcf-compare" in vcftools.
You can also use BEDTools to compare two vcfs.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
This file was generated by vcf-compare.
The command line was:
Can anyone explain what vcf-compare output means please?
Also when using a reference to compare sequence what use of uninitialized value... message means and the output? confused!
-m
is not work in vcf-compare option