Comparing Genotypes In Two Vcf Files
3
8
Entering edit mode
12.0 years ago
enza.colonna ▴ 80

Hi, I would like to quick compare genotypes in two VCF files for a set of individuals. Any quick method?

vcf genotyping • 19k views
ADD COMMENT
0
Entering edit mode

This file was generated by vcf-compare.

The command line was:

vcf-compare(r840) A_bowtie2.marked.ESAM.flt.vcf.gz A_BWA.marked.ESAM.flt.vcf.gz A_novoalign.marked.ESAM.flt.vcf.gz
VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
VN The columns are: 
VN        1  .. number of sites unique to this particular combination of files
VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN    3537    A.marked.ESAM.flt.vcf.gz (0.9%)    A_novoalign.marked.ESAM.flt.vcf.gz (0.5%)
VN    4506    A_BWA.marked.ESAM.flt.vcf.gz (0.7%)    A_bowtie2.marked.ESAM.flt.vcf.gz (1.1%)
VN    14668    A_bowtie2.marked.ESAM.flt.vcf.gz (3.6%)
VN    68272    A_BWA.marked.ESAM.flt.vcf.gz (10.2%)
VN    105267    A_novoalign.marked.ESAM.flt.vcf.gz (14.9%)
VN    210565    A_BWA.marked.ESAM.flt.vcf.gz (31.4%)    A_novoalign.marked.ESAM.flt.vcf.gz (29.8%)
VN    387387    A_BWA.marked.ESAM.flt.vcf.gz (57.8%)    A_bowtie2.marked.ESAM.flt.vcf.gz (94.5%)    A_novoalign.marked.ESAM.flt.vcf.gz (54.8%)

SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
SN    Number of REF matches:    386460
SN    Number of ALT matches:    376359
SN    Number of REF mismatches:    927
SN    Number of ALT mismatches:    10101
SN    Number of samples in GT comparison:    0

Can anyone explain what vcf-compare output means please?

Also when using a reference to compare sequence what use of uninitialized value... message means and the output? confused!

Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 83918.
Use of uninitialized value in subtraction (-) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1133, <$__ANONIO__> line 81038.
Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 81038.
Use of uninitialized value in subtraction (-) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1133, <$__ANONIO__> line 83921.
Use of uninitialized value in addition (+) at /home/rob/Downloads/vcftools/bin/vcf-compare line 1134, <$__ANONIO__> line 83921.
VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
VN The columns are:
VN        1  .. number of sites unique to this particular combination of files
VN        2- .. combination of files and space-separated number, a fraction of sites in the file
VN    3537    A_bowtie2.marked.ESAM.flt.vcf.gz (0.9%)    A_novoalign.marked.ESAM.flt.vcf.gz (0.5%)
VN    4506    A_BWA.marked.ESAM.flt.vcf.gz (0.7%)    A_bowtie2.marked.ESAM.flt.vcf.gz (1.1%)
VN    14668    A_bowtie2.marked.ESAM.flt.vcf.gz (3.6%)
VN    68272    A_BWA.marked.ESAM.flt.vcf.gz (10.2%)
VN    105267    A_novoalign.marked.ESAM.flt.vcf.gz (14.9%)
VN    210565    A_BWA.marked.ESAM.flt.vcf.gz (31.4%)    A_novoalign.marked.ESAM.flt.vcf.gz (29.8%)
VN    387387    A_BWA.marked.ESAM.flt.vcf.gz (57.8%)    A_bowtie2.marked.ESAM.flt.vcf.gz (94.5%)    A_novoalign.marked.ESAM.flt.vcf.gz (54.8%)
ADD REPLY
0
Entering edit mode

-m is not work in vcf-compare option

vcf-compare -m *.vcf.gz
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\232&quot; at /usr/local/bin/vcf-compare line 142.
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\362&quot; at /usr/local/bin/vcf-compare line 142.
Missing argument in sprintf at /usr/local/bin/vcf-compare line 142.
Invalid conversion in sprintf: &quot;%\026&quot; at /usr/local/bin/vcf-compare line 142.
Expected 1 column names, found [�BC�(�}{o�Ȗ�ߙOA�`0   Fq��,V��n@��m��R���-�6o˒V����|�=��EI��$
                                                                                                  �}qc��T<<�:�z��&lt;���q:�'���_�#���CO�EpBm�l�&@����08��gn���5z����G���ç˫����.   q)��c0&������#�4{��p�t2������?�E8��ͧ�`�x=��/�d�������Ŭ��/���d� M�/G����i���h2�%��藘�pq��`
<������_������!���V�u�\N�S0o��p���'����.O���Qpd�#k������wvw[����ܚM�����l:_#+�h�:���h����S����5�����0�&gt;��(���|W`�������}N�o%�?���a��B�����3��w���5}���I`E�/�q[���s`~�f8��y`��q���^-�N��Bo��T��
                                            ��Xh    ]�Cl=�H���z�p<�<�2��%nK}����{�^{��DJ;S�MVg<Ɓu;��I۲���r�8�0jY���5hY��MA��a�r��s����:��-�a
                                                                                                                                                          �>�#�jw:].
 at /usr/local/bin/vcf-compare line 32.
    main::error('Expected 1 column names, found [\x{1f}\x{8b}\x{8}\x{4}\x{0}\x{0}\x{0}\x{0}\x{0}\x{ff}\x{6}\x{0}BC\x{2}\x{0}\x{f5}(\x{ed}}{o\x{db}\x{c8}\x{96}\x{e7}\x{df}\x{99}O...') called at /usr/local/bin/vcf-compare line 142
    main::read_mappings_list('clinvar.vcf.gz', 'ARRAY(0xac0ee0)') called at /usr/local/bin/vcf-compare line 163
    main::compare_vcfs('HASH(0xac0b50)') called at /usr/local/bin/vcf-compare line 22
ADD REPLY
9
Entering edit mode
12.0 years ago
Erik Garrison ★ 2.4k

If you want to directly compare genotypes, look at vcfgtcompare in vcflib. You can use it like this:

vcfgtcompare.sh other this.vcf other.vcf

The output will be this.vcf with genotype concordance annotations provided in the INFO column:

##INFO=<ID=other.has_variant,Number=0,Type=Flag,Description="True if other has a called alternate among samples under comparison.">
##INFO=<ID=other.genotypes.AA_AA,Number=1,Type=Integer,Description="Number of genotypes with AA_AA relationship with other">
##INFO=<ID=other.genotypes.AA_AR,Number=1,Type=Integer,Description="Number of genotypes with AA_AR relationship with other">
##INFO=<ID=other.genotypes.AA_RR,Number=1,Type=Integer,Description="Number of genotypes with AA_RR relationship with other">
##INFO=<ID=other.genotypes.AA_NN,Number=1,Type=Integer,Description="Number of genotypes with AA_NN relationship with other">
##INFO=<ID=other.genotypes.AR_AA,Number=1,Type=Integer,Description="Number of genotypes with AR_AA relationship with other">
... etc.

I typically use the other.genotypes.AA_AA, other.genotypes.AA_AR, etc. to tabulate concordance statistics. Here AA_AR means that in the first file the genotype was alternate/alternate, and in the second it was alternate/reference. N means "no-call" or partial no-call, which might happen if you were to remove non-intersecting alleles using vcfintersect. To do this for many loci, I drop the data into tab-separated format using vcf2tsv:

% vcfgtcompare.sh other this.vcf other.vcf | vcf2tsv >this_other_comparison.tsv
% R
> this_other <- read.delim("this_other_comparison.tsv")
> this_other[is.na(this_other)] <- 0  # converts NA's to 0 for later processing

Then I use this function to tabulate genotype concordance in R:

gterror <- function(s, n) {
   with(s,
   (sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
    + sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))
   / ((sum(eval(as.symbol(paste(n, ".genotypes.RR_RR", sep=""))))
       + sum(eval(as.symbol(paste(n, ".genotypes.AA_AA", sep=""))))
       + sum(eval(as.symbol(paste(n, ".genotypes.AR_AR", sep="")))))
      + (sum(eval(as.symbol(paste(n, ".genotypes.AA_AR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AR_AA", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AA_RR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.RR_AA", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.AR_RR", sep=""))))
         + sum(eval(as.symbol(paste(n, ".genotypes.RR_AR", sep="")))))))
}

In this way (in R):

> gterror(this_other, "other")  # yields rate of discordance between the two sets
ADD COMMENT
0
Entering edit mode

Thanks Erik, does it overwrite this.vcf?

ADD REPLY
0
Entering edit mode

No, it does not. With a few exceptions, all of the tools in vcflib read VCF (often on stdin) and write VCF on stdout. So here, the VCF stream which it prints on stdout is the result. The inputs are unchanged.

ADD REPLY
0
Entering edit mode

thanks Erik for clarifying this.

ADD REPLY
4
Entering edit mode
12.0 years ago

There is a tool "vcf-compare" in vcftools.

ADD COMMENT
0
Entering edit mode

Thanks Pierre, this is useful.

ADD REPLY
0
Entering edit mode
12.0 years ago

You can also use BEDTools to compare two vcfs.

ADD COMMENT

Login before adding your answer.

Traffic: 2062 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6