I have a VCF file I have downloaded from 1000genomes, I've then filtered the file I've got down to 5 samples using this script:
bcftools view --samples-file my5RandomIDs.txt ALL.chr22.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz -o myNewVCF.vcf
So far so good, it is my understanding that for each sample, the 0|0, 0|1, 1|0, 1|1 represents which allele the samples have the variant on and 0|0 means the sample does not have the variant. The problem is there are quite a few variants I can find (especially structural variants) that have 0|0 for all the samples which does not make sense to me because if none of the samples have the variant, it should not be on the VCF file. You can find the screen shot of the file (it is filtered to only structural variants), what is causing this behavior or did I misunderstand something fundamental.
Much thanks in advance.
While splitting samples you can use the
-c 1
parameter with bcftools view to filter out ref lines.Please do not paste screenshots of plain text content, it is counterproductive. You can copy paste the content directly here (using the code formatting option shown below), or use a GitHub Gist if the content volume exceeds allowed length here.