I have a treatment vcf file and three control vcf files. This has been generated from somatic variants on RNAseq data. I want to extract variants that are in treatment sample but not in control groups. To do so, I first normalized them using bcftools norm -m -any
command, then merge all control samples to a one single vcf file using bcftools merge
or eitheir 'bcftools merge
. Then I tried to compare treatment sample with merged control group using vcf-compare
like below
A: treatment sample , B,C,D: control samples
vcf-compare A.norm.vcf.gz control.norm.vcf.gz
# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) A.norm.vcf.gz control.norm.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN 1 .. number of sites unique to this particular combination of files
#VN 2- .. combination of files and space-separated number, a fraction of sites in the file
#VN 330531 A.norm.vcf.gz (41.2%)
#VN 472004 A.norm.vcf.gz (58.8%) ccontrol.norm.vcf.gz (41.5%)
#VN 665094 ccontrol.norm.vcf.gz (58.5%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
#SN Number of REF matches: 462434
#SN Number of ALT matches: 454586
#SN Number of REF mismatches: **9570**
#SN Number of ALT mismatches: **7848**
#SN Number of samples in GT comparison: 0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
#SN Number of lost sites: 3073 0.4% 805608 802535 A.norm.vcf.gz
#SN Number of lost sites: 4897 0.4% 1141995 1137098 control.norm.vcf.gz
#> the SN block refers to sites (lines in the VCF file)
#> the GS block refers to actual genotypes (sample columns).
#> VN is the number of sites shared by both files
And this is the output of bcftools stat
:
bcftools stats -c both A.norm.vcf.gz control.norm.vcf.gz
# Definition of sets:
# ID [2]id [3]tab-separated file names
ID 0 A.norm.vcf.gz
ID 1 B.norm.vcf.gz.gz
ID 2 A.norm.vcf.gz B.norm.vcf.gz.gz
# SN, Summary numbers:
# number of records .. number of data rows in the VCF
# number of no-ALTs .. reference-only sites, ALT is either "." or identical to REF
# number of SNPs .. number of rows with a SNP
# number of MNPs .. number of rows with a MNP, such as CC>TT
# number of indels .. number of rows with an indel
# number of others .. number of rows with other type, for example a symbolic allele or
# a complex substitution, such as ACT>TCGA
# number of multiallelic sites .. number of rows with multiple alternate alleles
# number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#
# Note that rows containing multiple types will be counted multiple times, in each
# counter. For example, a row with a SNP and an indel increments both the SNP and
# the indel counter.
#
# SN [2]id [3]key [4]value
SN 0 number of samples: 1
SN 1 number of samples: 1
SN 0 number of records: 530759
SN 0 number of no-ALTs: 0
SN 0 number of SNPs: 451656
SN 0 number of MNPs: 0
SN 0 number of indels: 78691
SN 0 number of others: 0
SN 0 number of multiallelic sites: 0
SN 0 number of multiallelic SNP sites: 0
SN 1 number of records: 203908
SN 1 number of no-ALTs: 0
SN 1 number of SNPs: 165867
SN 1 number of MNPs: 0
SN 1 number of indels: 37932
SN 1 number of others: 0
SN 1 number of multiallelic sites: 0
SN 1 number of multiallelic SNP sites: 0
SN 2 number of records: 274849
SN 2 number of no-ALTs: 0
SN 2 number of SNPs: 245587
SN 2 number of MNPs: 0
SN 2 number of indels: 29182
SN 2 number of others: 0
SN 2 number of multiallelic sites: 0
SN 2 number of multiallelic SNP sites: 0
# TSTV, transitions/transversions:
# TSTV [2]id [3]ts [4]tv [5]ts/tv [6]ts (1st ALT) [7]tv (1st ALT) [8]ts/tv (1st ALT)
TSTV 0 366683 84973 4.32 366683 84973 4.32
TSTV 1 136855 29012 4.72 136855 29012 4.72
TSTV 2 179044 66543 2.69 179044 66543 2.69
# SiS, Singleton stats:
# SiS [2]id [3]allele count [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
SiS 0 1 284740 242699 42041 56195 0 0 56195
SiS 1 1 75884 66546 9338 24087 0 0 24087
SiS 2 1 124678 94975 29703 18812 0 0 18812
# AF, Stats by non-reference allele frequency:
# AF [2]id [3]allele frequency [4]number of SNPs [5]number of transitions [6]number of transversions [7]number of indels [8]repeat-consistent [9]repeat-inconsistent [10]not applicable
AF 0 0.000000 284740 242699 42041 56195 0 0 56195
AF 0 0.990000 166916 123984 42932 22496 0 0 22496
AF 1 0.000000 75884 66546 9338 24087 0 0 24087
AF 1 0.990000 89983 70309 19674 13845 0 0 13845
AF 2 0.000000 124678 94975 29703 18812 0 0 18812
AF 2 0.990000 120909 84069 36840 10370 0 0 10370
As you can see, there is only 9570 unique variants to the treatment sample. However when I try to extract these variants using bcftools isec -c both
, the return output in 0000.txt has more 180000 unique variants. How is that possible? Where am I going wrong? Any help would be appreciated.