Extract different variants of a vcf file comparing to other vcf files
0
0
Entering edit mode
21 months ago
minoo ▴ 10

I have a treatment vcf file and three control vcf files. This has been generated from somatic variants on RNAseq data. I want to extract variants that are in treatment sample but not in control groups. To do so, I first normalized them using bcftools norm -m -any command, then merge all control samples to a one single vcf file using bcftools merge or eitheir 'bcftools merge . Then I tried to compare treatment sample with merged control group using vcf-compare like below

A: treatment sample , B,C,D: control samples

vcf-compare A.norm.vcf.gz control.norm.vcf.gz

# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) A.norm.vcf.gz control.norm.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
#VN    330531   A.norm.vcf.gz (41.2%)
#VN    472004    A.norm.vcf.gz (58.8%)    ccontrol.norm.vcf.gz (41.5%)
#VN    665094    ccontrol.norm.vcf.gz (58.5%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
#SN    Number of REF matches:    462434
#SN    Number of ALT matches:    454586
#SN    Number of REF mismatches:    **9570**
#SN    Number of ALT mismatches:    **7848**
#SN    Number of samples in GT comparison:    0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
#SN    Number of lost sites:    3073    0.4%    805608    802535    A.norm.vcf.gz
#SN    Number of lost sites:    4897    0.4%    1141995    1137098    control.norm.vcf.gz
#> the SN block refers to sites (lines in the VCF file)
#> the GS block refers to actual genotypes (sample columns).
#> VN is the number of sites shared by both files

And this is the output of bcftools stat:

bcftools stats -c both A.norm.vcf.gz control.norm.vcf.gz
# Definition of sets:
# ID    [2]id    [3]tab-separated file names
ID    0    A.norm.vcf.gz
ID    1    B.norm.vcf.gz.gz
ID    2    A.norm.vcf.gz    B.norm.vcf.gz.gz
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#
#   Note that rows containing multiple types will be counted multiple times, in each
#   counter. For example, a row with a SNP and an indel increments both the SNP and
#   the indel counter.
#
# SN    [2]id    [3]key    [4]value
SN    0    number of samples:    1
SN    1    number of samples:    1
SN    0    number of records:    530759
SN    0    number of no-ALTs:    0
SN    0    number of SNPs:    451656
SN    0    number of MNPs:    0
SN    0    number of indels:    78691
SN    0    number of others:    0
SN    0    number of multiallelic sites:    0
SN    0    number of multiallelic SNP sites:    0
SN    1    number of records:    203908
SN    1    number of no-ALTs:    0
SN    1    number of SNPs:    165867
SN    1    number of MNPs:    0
SN    1    number of indels:    37932
SN    1    number of others:    0
SN    1    number of multiallelic sites:    0
SN    1    number of multiallelic SNP sites:    0
SN    2    number of records:    274849
SN    2    number of no-ALTs:    0
SN    2    number of SNPs:    245587
SN    2    number of MNPs:    0
SN    2    number of indels:    29182
SN    2    number of others:    0
SN    2    number of multiallelic sites:    0
SN    2    number of multiallelic SNP sites:    0
# TSTV, transitions/transversions:
# TSTV    [2]id    [3]ts    [4]tv    [5]ts/tv    [6]ts (1st ALT)    [7]tv (1st ALT)    [8]ts/tv (1st ALT)
TSTV    0    366683    84973    4.32    366683    84973    4.32
TSTV    1    136855    29012    4.72    136855    29012    4.72
TSTV    2    179044    66543    2.69    179044    66543    2.69
# SiS, Singleton stats:
# SiS    [2]id    [3]allele count    [4]number of SNPs    [5]number of transitions    [6]number of transversions    [7]number of indels    [8]repeat-consistent    [9]repeat-inconsistent    [10]not applicable
SiS    0    1    284740    242699    42041    56195    0    0    56195
SiS    1    1    75884    66546    9338    24087    0    0    24087
SiS    2    1    124678    94975    29703    18812    0    0    18812
# AF, Stats by non-reference allele frequency:
# AF    [2]id    [3]allele frequency    [4]number of SNPs    [5]number of transitions    [6]number of transversions    [7]number of indels    [8]repeat-consistent    [9]repeat-inconsistent    [10]not applicable
AF    0    0.000000    284740    242699    42041    56195    0    0    56195
AF    0    0.990000    166916    123984    42932    22496    0    0    22496
AF    1    0.000000    75884    66546    9338    24087    0    0    24087
AF    1    0.990000    89983    70309    19674    13845    0    0    13845
AF    2    0.000000    124678    94975    29703    18812    0    0    18812
AF    2    0.990000    120909    84069    36840    10370    0    0    10370

As you can see, there is only 9570 unique variants to the treatment sample. However when I try to extract these variants using bcftools isec -c both , the return output in 0000.txt has more 180000 unique variants. How is that possible? Where am I going wrong? Any help would be appreciated.

bedtools mutect2 vcftools gatk • 1.1k views
ADD COMMENT

Login before adding your answer.

Traffic: 2354 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6