I have a vcf where there are many calls for SNPs that are the same across all my samples in contrast only to the reference, but I only want to keep SNPs that are variable within my set (I don't care about the reference). How do i get rid of those with vcf tools?
Also, can you give us examples for sites you wish to pick and sites you wish to ignore, including the logic for picking them? The current description does not help us understand your requirement well.
I think I get what you want. For example, you have 6 yeast strains, 1 is a control and the other 5 represent conditions, and you're interested in how they all differ from each other or from the control. If they all differ equally from the reference (including the control) this is not an interesting variant because it is present in the control, and was thus already there when the experiment started. You want to ignore positions in your VCF file that are all equally different from the reference. One way to do this (I'm not sure it's the best way), is to use gatk SelectVariants and create a compound select statement using getGenotype("sampleID1").isHomVar() && getGenotype("sampleID2").isHomVar() etc. to select positions that are variant across all your samples, and put these into a file. You can then use this file with SelectVariants a second time on your original VCF file to select positions with the --discordance argument. This will grab the locations which are not uniformly variant. Seems complicated, and like it should be simpler, but that's one way.
I understand your question as asking how to remove all sites where ALL samples carry only the alternate allele (e.g. sites where all individuals have the genotype "1/1", for example).
You can remove these sites with vcftools with a command similar to this:
bcftools filter -e "INFO/AF=1.0" -O z -o output.vcf.gz input.vcf
You may want to ensure you don't throw out sites where one or two samples are called as 1/1 and the rest are no calls, in which case you could specify a minimum allele count. This is probably easiest to do with bcftools. For example, only exclude a variant if it has an AF of 1.0 AND at least 10 called alleles (e.g. 5 diploid genotypes):
What have you tried? Have you looked at the bcftools and vcftools manual pages?
Also, can you give us examples for sites you wish to pick and sites you wish to ignore, including the logic for picking them? The current description does not help us understand your requirement well.
I think I get what you want. For example, you have 6 yeast strains, 1 is a control and the other 5 represent conditions, and you're interested in how they all differ from each other or from the control. If they all differ equally from the reference (including the control) this is not an interesting variant because it is present in the control, and was thus already there when the experiment started. You want to ignore positions in your VCF file that are all equally different from the reference. One way to do this (I'm not sure it's the best way), is to use gatk
SelectVariants
and create a compound select statement usinggetGenotype("sampleID1").isHomVar() && getGenotype("sampleID2").isHomVar()
etc. to select positions that are variant across all your samples, and put these into a file. You can then use this file withSelectVariants
a second time on your original VCF file to select positions with the--discordance
argument. This will grab the locations which are not uniformly variant. Seems complicated, and like it should be simpler, but that's one way.