bcftools stats numbers do not match
0
0
Entering edit mode
5.4 years ago
VBer ▴ 200

I wanted to compare two vcf files. I used the following command:

bcftools stats A.vcf.gz Bvcf.gz >A_vs_B.vchk

The following is the output from the .vchk file:

# Definition of sets:
# ID    [2]id   [3]tab-separated file names
ID  0   A.vcf.gz
ID  1   B.vcf.gz
ID  2   A.vcf.gz    B.vcf.gz
.
.
.
# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  2
SN  1   number of samples:  1
SN  0   number of records:  11502370
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 8593991
SN  0   number of MNPs: 0
SN  0   number of indels:   2908379
SN  0   number of others:   0
SN  0   number of multiallelic sites:   94249
SN  0   number of multiallelic SNP sites:   29157

SN  1   number of records:  3201430
SN  1   number of no-ALTs:  0
SN  1   number of SNPs: 1109969
SN  1   number of MNPs: 1292599
SN  1   number of indels:   757580
SN  1   number of others:   331637
SN  1   number of multiallelic sites:   393078
SN  1   number of multiallelic SNP sites:   8811

I edited out the intersection between the two files.

Before doing this, I also did a bcftools stats on both A.vcf.gz and B.vcf.gz separately.

The numbers from the individual .vcf files and the comparison vary. For example, for A.vcf.gz , the number of SNPs was given as 15234759 and indels as 2908379. If you look at the .vchk file output, it says the number of SNPs in A.vcf.gz is 8593991. However, the number of indels remains the same. I observed this for the B.vcf.gz file as well.

Why are the numbers different only for SNPs?

In case you want to look at the individual bcftools stats output...

Output for A.vcf.gz

# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  2
SN  0   number of records:  18143138
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 15234759
SN  0   number of MNPs: 0
SN  0   number of indels:   2908379
SN  0   number of others:   0
SN  0   number of multiallelic sites:   111076
SN  0   number of multiallelic SNP sites:   45984

Output for B.vcf.gz

# SN    [2]id   [3]key  [4]value
SN  0   number of samples:  1
SN  0   number of records:  9842198
SN  0   number of no-ALTs:  0
SN  0   number of SNPs: 7750737
SN  0   number of MNPs: 1292599
SN  0   number of indels:   757580
SN  0   number of others:   331637
SN  0   number of multiallelic sites:   409905
SN  0   number of multiallelic SNP sites:   25638
bcftools • 2.4k views
ADD COMMENT
1
Entering edit mode

Actually it looks like the bcfstats output in the comparison seems to report only unique SNPs and INDELs for the individual files.

Eg.Number of SNPs in individual file A (18143138) - Number of SNPs given in the comparison file for file A (11502370) Total number of SNPs given as common for both file and A and B (6640768) (Not given above)

The comparison also reported 0 indels as being common, which makes sense because the indel numbers are the same. No indels are common between the two files, therefore the number of indels reported as unique is what was originally contained in the file.

ADD REPLY

Login before adding your answer.

Traffic: 2343 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6