BCFtools isec output vs BCFtools query
1
1
Entering edit mode
14 months ago
gtho123 ▴ 260

I have 5 samples and I have performed variant calling using the Strelka2 software. The output of this software is a VCFfile for each sample and a VCF file containing all the variants across all samples.

I am interested in looking at which variants are shared among and/or unique to the samples. I therefore filtered the sample variant files to only contain what the software terms "high-confidence" and I then used bcftools isec. The command resembled this:

bcftools isec -p ./overlaps/ ./CL10_PASS.vcf.gz ./CL11_PASS.vcf.gz ./CL12_PASS.vcf.gz ./CL8_PASS.vcf.gz ./CL9_PASS.vcf.gz

The output contained a file called sites.txt, the beginning of which I have posted below. I am not totally sure but I interpret the fifth column to represent presence/absence of a variant for each of the samples I entered in the command, in the same order I entered them.

MyComputer$ head sites.txt 
Chr1    1074    C   A,T 00010
Chr1    1075    TC  T   11001
Chr1    6324    T   TA  11111
Chr1    12997   CT  C,CTT   00010
Chr1    88879   C   G   11101
Chr1    88937   G   A   11101
Chr1    104023  TA  T   11110
Chr1    183752  CA  CAA,C   11000
Chr1    195520  A   AC  11111
Chr1    214644  A   AT  11111

I then used bcftools query to summarise the VCF file containing all the variants across all samples, after again filtering for "high-confidence" variants:

bcftools view -f PASS ./variants.vcf.gz | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%TYPE[\t%GT]\n' 

This creates output which looks like this:

MyComputer$ head variantPassSummary.tsv 
Chr1    1074    C   A,T SNP 0/2 0/1 0/0 0/0 0/0
Chr1    1075    TC  T   INDEL   0/1 0/1 0/1 0/1 0/1
Chr1    6324    T   TA  INDEL   1/1 1/1 1/1 1/1 1/1
Chr1    12997   CT  C,CTT   INDEL   0/1 0/1 0/0 0/0 0/0
Chr1    88879   C   G   SNP 1/1 0/0 1/1 1/1 1/1
Chr1    88937   G   A   SNP 1/1 0/0 1/1 1/1 1/1
Chr1    104023  TA  T   INDEL   0/1 0/1 0/1 0/0 0/1
Chr1    183752  CA  CAA,C   INDEL   0/1 0/1 0/2 0/0 0/0
Chr1    195520  A   AC  INDEL   1/1 1/1 1/1 1/1 1/1
Chr1    214644  A   AT  INDEL   1/1 1/1 1/1 1/1 1/1

As you can see the outputs do not appear to say the same thing. For instance in the first row, the first two samples are heterozygous in the bcftools query output and in the bcftools isec sites.txt file only the fourth sample has this variant. There seems to be discrepancies in every row except thosw that are homozygous for the ALT allele in all samples.

The samples in the VCF file containing all the samples has the samples in the same order as I entered them into bcftools isec.

Why is this occurring and what am I misunderstanding?

bcftools VCF strelka2 htslib • 736 views
ADD COMMENT
0
Entering edit mode
14 months ago
raphael.B ▴ 520

My guess is, when you filter your genotypes in both cases you are not going to remove the same records. In the grouped VCF, quality metrics are computed across all samples. So a poor quality genotype is more likely to remain if the overall quality of the record is sufficient. In a single sample vcf, this record won't pass the filter though.

ADD COMMENT

Login before adding your answer.

Traffic: 1724 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6