I have a VCF file and I want to remove sites that are not polymorphic sites i.e. sites where all individuals are REF/REF or ALT/ALT. Is there a quick way to do this using bcftools or vcftools?
Thanks.
I have a VCF file and I want to remove sites that are not polymorphic sites i.e. sites where all individuals are REF/REF or ALT/ALT. Is there a quick way to do this using bcftools or vcftools?
Thanks.
Now I hopefully understand your goal. This will remove side where all samples are either hom ref or hom alt:
$ bcftools view -e 'COUNT(GT="AA")=N_SAMPLES || COUNT(GT="RR")=N_SAMPLES' input.vcf
Have a look at the manual to explore more ways to filter vcf
files using bcftools
.
I've deleted my old answer, because it hasn't answer your question.
You can also use the -c
function in view
-c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
minimum allele count (INFO/AC) of sites to be printed. Specifying the type of allele is optional and can be set to non-reference (nref, the default), 1st alternate (alt1), the least frequent (minor), the most frequent (major) or sum of all but the most frequent (nonmajor) alleles.
By setting -c 1
, you will only keep sites with at least one nonref allele (=polymorphic in your data)
bcftools view -c 1 input.vcf.gz -o output.vcf.gz -Oz
using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html
java -jar dist/vcffilterjdk.jar -e 'return !(variant.getGenotypes().stream().allMatch(G->G.isHomVar()) || variant.getGenotypes().stream().allMatch(G->G.isHomRef()));' input.vcf
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thanks! Yes, I think that answers my question. I got this error message though:
the tag "INFO/COUNT" is not defined in the VCF header
. Do you know how to add this info on my vcf?I think you're using an outdated version of
bcftools
. The methodCOUNT
was introduced in v1.7. The current version is v1.9.Which version are you using? Please upgrade.
Using v1.9 solved it - Thanks!
@finswimmer have a quick question, will bcftools get the COUNT of the AA or RR genotypes based on the INFO field? Or will it compute the COUNT for all sites? I am applied this filter after merging my dataset with another dataset and I think I am getting weird results.
For each line in the vcf file
bcftools
will have a look at the genotype of each sample. In the INFO column there is no information about the counts.Why do you think you are getting a weird result? As always, a small example dataset which shows your problem would be helpful.
The command will not work if there are missing genotypes in the given position.