When creating VCF files with bcftools I can obtain only the variant positions using -v option. But when I have a VCF file with all postions and I want to filter only for the ones with variations, which is the most appropiate option in vcftools?
I could do
bcftools -Scvg allpositions.vcf.gz > only_variants.vcf
But recalling the SNVs and genotypes seems a waste of CPU cycles to me.
I am not sure if it would be better to use any of these options in vcftools:
--non-ref-ac <float>
--max-non-ref-ac <float>
Include only sites with all Non-Reference Allele Counts within the specified range.
Or:
--min-alleles <int>
--max-alleles <int>
Include only sites with a number of alleles within the specified range. For example, to include only bi-allelic sites, one could use:
vcftools --vcf file1.vcf --min-alleles 2 --max-alleles 2
Any help would be appreciated, and sorry if this has been asked before but I am not finding an answer for it.
what do you mean exactly with "only variation sites"?
Sites where an alternative allele has been found and genotype is "0/1" or "1/1". This is for single sample vcf files. When I have large multisample vcf I apply another filter like 'remove all-homozygote positons' to deal with reference alleles that are indeed variants not seen in our population.