I have 5 samples and I have performed variant calling using the Strelka2
software. The output of this software is a VCF
file for each sample and a VCF
file containing all the variants across all samples.
I am interested in looking at which variants are shared among and/or unique to the samples. I therefore filtered the sample variant files to only contain what the software terms "high-confidence" and I then used bcftools isec
. The command resembled this:
bcftools isec -p ./overlaps/ ./CL10_PASS.vcf.gz ./CL11_PASS.vcf.gz ./CL12_PASS.vcf.gz ./CL8_PASS.vcf.gz ./CL9_PASS.vcf.gz
The output contained a file called sites.txt
, the beginning of which I have posted below. I am not totally sure but I interpret the fifth column to represent presence/absence of a variant for each of the samples I entered in the command, in the same order I entered them.
MyComputer$ head sites.txt
Chr1 1074 C A,T 00010
Chr1 1075 TC T 11001
Chr1 6324 T TA 11111
Chr1 12997 CT C,CTT 00010
Chr1 88879 C G 11101
Chr1 88937 G A 11101
Chr1 104023 TA T 11110
Chr1 183752 CA CAA,C 11000
Chr1 195520 A AC 11111
Chr1 214644 A AT 11111
I then used bcftools query
to summarise the VCF
file containing all the variants across all samples, after again filtering for "high-confidence" variants:
bcftools view -f PASS ./variants.vcf.gz | bcftools query -f '%CHROM\t%POS\t%REF\t%ALT\t%TYPE[\t%GT]\n'
This creates output which looks like this:
MyComputer$ head variantPassSummary.tsv
Chr1 1074 C A,T SNP 0/2 0/1 0/0 0/0 0/0
Chr1 1075 TC T INDEL 0/1 0/1 0/1 0/1 0/1
Chr1 6324 T TA INDEL 1/1 1/1 1/1 1/1 1/1
Chr1 12997 CT C,CTT INDEL 0/1 0/1 0/0 0/0 0/0
Chr1 88879 C G SNP 1/1 0/0 1/1 1/1 1/1
Chr1 88937 G A SNP 1/1 0/0 1/1 1/1 1/1
Chr1 104023 TA T INDEL 0/1 0/1 0/1 0/0 0/1
Chr1 183752 CA CAA,C INDEL 0/1 0/1 0/2 0/0 0/0
Chr1 195520 A AC INDEL 1/1 1/1 1/1 1/1 1/1
Chr1 214644 A AT INDEL 1/1 1/1 1/1 1/1 1/1
As you can see the outputs do not appear to say the same thing. For instance in the first row, the first two samples are heterozygous in the bcftools query
output and in the bcftools isec sites.txt
file only the fourth sample has this variant. There seems to be discrepancies in every row except thosw that are homozygous for the ALT allele in all samples.
The samples in the VCF
file containing all the samples has the samples in the same order as I entered them into bcftools isec
.
Why is this occurring and what am I misunderstanding?