Question

Extract different variants of a vcf file comparing to other vcf files

0

Entering edit mode

21 months ago

minoo ▴ 10

I have a treatment vcf file and three control vcf files. This has been generated from somatic variants on RNAseq data. I want to extract variants that are in treatment sample but not in control groups. To do so, I first normalized them using bcftools norm -m -any command, then merge all control samples to a one single vcf file using bcftools merge or eitheir 'bcftools merge . Then I tried to compare treatment sample with merged control group using vcf-compare like below

A: treatment sample , B,C,D: control samples

vcf-compare A.norm.vcf.gz control.norm.vcf.gz

# This file was generated by vcf-compare.
# The command line was: vcf-compare(v0.1.14-12-gcdb80b8) A.norm.vcf.gz control.norm.vcf.gz
#
#VN 'Venn-Diagram Numbers'. Use `grep ^VN | cut -f 2-` to extract this part.
#VN The columns are:
#VN        1  .. number of sites unique to this particular combination of files
#VN        2- .. combination of files and space-separated number, a fraction of sites in the file
#VN    330531   A.norm.vcf.gz (41.2%)
#VN    472004    A.norm.vcf.gz (58.8%)    ccontrol.norm.vcf.gz (41.5%)
#VN    665094    ccontrol.norm.vcf.gz (58.5%)
#SN Summary Numbers. Use `grep ^SN | cut -f 2-` to extract this part.
#SN    Number of REF matches:    462434
#SN    Number of ALT matches:    454586
#SN    Number of REF mismatches:    **9570**
#SN    Number of ALT mismatches:    **7848**
#SN    Number of samples in GT comparison:    0
# Number of sites lost due to grouping (e.g. duplicate sites): lost, %lost, read, reported, file
#SN    Number of lost sites:    3073    0.4%    805608    802535    A.norm.vcf.gz
#SN    Number of lost sites:    4897    0.4%    1141995    1137098    control.norm.vcf.gz
#> the SN block refers to sites (lines in the VCF file)
#> the GS block refers to actual genotypes (sample columns).
#> VN is the number of sites shared by both files

And this is the output of bcftools stat:

bcftools stats -c both A.norm.vcf.gz control.norm.vcf.gz
# Definition of sets:
# ID    [2]id    [3]tab-separated file names
ID    0    A.norm.vcf.gz
ID    1    B.norm.vcf.gz.gz
ID    2    A.norm.vcf.gz    B.norm.vcf.gz.gz
# SN, Summary numbers:
#   number of records   .. number of data rows in the VCF
#   number of no-ALTs   .. reference-only sites, ALT is either "." or identical to REF
#   number of SNPs      .. number of rows with a SNP
#   number of MNPs      .. number of rows with a MNP, such as CC>TT
#   number of indels    .. number of rows with an indel
#   number of others    .. number of rows with other type, for example a symbolic allele or
#                          a complex substitution, such as ACT>TCGA
#   number of multiallelic sites     .. number of rows with multiple alternate alleles
#   number of multiallelic SNP sites .. number of rows with multiple alternate alleles, all SNPs
#
#   Note that rows containing multiple types will be counted multiple times, in each
#   counter. For example, a row with a SNP and an indel increments both the SNP and
#   the indel counter.
#
# SN    [2]id    [3]key    [4]value
SN    0    number of samples:    1
SN    1    number of samples:    1
SN    0    number of records:    530759
SN    0    number of no-ALTs:    0
SN    0    number of SNPs:    451656
SN    0    number of MNPs:    0
SN    0    number of indels:    78691
SN    0    number of others:    0
SN    0    number of multiallelic sites:    0
SN    0    number of multiallelic SNP sites:    0
SN    1    number of records:    203908
SN    1    number of no-ALTs:    0
SN    1    number of SNPs:    165867
SN    1    number of MNPs:    0
SN    1    number of indels:    37932
SN    1    number of others:    0
SN    1    number of multiallelic sites:    0
SN    1    number of multiallelic SNP sites:    0
SN    2    number of records:    274849
SN    2    number of no-ALTs:    0
SN    2    number of SNPs:    245587
SN    2    number of MNPs:    0
SN    2    number of indels:    29182
SN    2    number of others:    0
SN    2    number of multiallelic sites:    0
SN    2    number of multiallelic SNP sites:    0
# TSTV, transitions/transversions:
# TSTV    [2]id    [3]ts    [4]tv    [5]ts/tv    [6]ts (1st ALT)    [7]tv (1st ALT)    [8]ts/tv (1st ALT)
TSTV    0    366683    84973    4.32    366683    84973    4.32
TSTV    1    136855    29012    4.72    136855    29012    4.72
TSTV    2    179044    66543    2.69    179044    66543    2.69
# SiS, Singleton stats:
# SiS    [2]id    [3]allele count    [4]number of SNPs    [5]number of transitions    [6]number of transversions    [7]number of indels    [8]repeat-consistent    [9]repeat-inconsistent    [10]not applicable
SiS    0    1    284740    242699    42041    56195    0    0    56195
SiS    1    1    75884    66546    9338    24087    0    0    24087
SiS    2    1    124678    94975    29703    18812    0    0    18812
# AF, Stats by non-reference allele frequency:
# AF    [2]id    [3]allele frequency    [4]number of SNPs    [5]number of transitions    [6]number of transversions    [7]number of indels    [8]repeat-consistent    [9]repeat-inconsistent    [10]not applicable
AF    0    0.000000    284740    242699    42041    56195    0    0    56195
AF    0    0.990000    166916    123984    42932    22496    0    0    22496
AF    1    0.000000    75884    66546    9338    24087    0    0    24087
AF    1    0.990000    89983    70309    19674    13845    0    0    13845
AF    2    0.000000    124678    94975    29703    18812    0    0    18812
AF    2    0.990000    120909    84069    36840    10370    0    0    10370

As you can see, there is only 9570 unique variants to the treatment sample. However when I try to extract these variants using bcftools isec -c both , the return output in 0000.txt has more 180000 unique variants. How is that possible? Where am I going wrong? Any help would be appreciated.

bedtools mutect2 vcftools gatk • 1.1k views

ADD COMMENT • link updated 17 months ago by Ram 44k • written 21 months ago by minoo ▴ 10