Remove not polymorphic sites from VCF
4
1
Entering edit mode
5.8 years ago

I have a VCF file and I want to remove sites that are not polymorphic sites i.e. sites where all individuals are REF/REF or ALT/ALT. Is there a quick way to do this using bcftools or vcftools?

Thanks.

vcf genome • 9.7k views
ADD COMMENT
5
Entering edit mode
5.8 years ago

Now I hopefully understand your goal. This will remove side where all samples are either hom ref or hom alt:

$ bcftools view -e 'COUNT(GT="AA")=N_SAMPLES || COUNT(GT="RR")=N_SAMPLES' input.vcf

Have a look at the manual to explore more ways to filter vcf files using bcftools.

I've deleted my old answer, because it hasn't answer your question.

ADD COMMENT
0
Entering edit mode

Thanks! Yes, I think that answers my question. I got this error message though: the tag "INFO/COUNT" is not defined in the VCF header. Do you know how to add this info on my vcf?

ADD REPLY
1
Entering edit mode

I think you're using an outdated version of bcftools. The method COUNT was introduced in v1.7. The current version is v1.9.

Which version are you using? Please upgrade.

ADD REPLY
0
Entering edit mode

Using v1.9 solved it - Thanks!

ADD REPLY
0
Entering edit mode

@finswimmer have a quick question, will bcftools get the COUNT of the AA or RR genotypes based on the INFO field? Or will it compute the COUNT for all sites? I am applied this filter after merging my dataset with another dataset and I think I am getting weird results.

ADD REPLY
0
Entering edit mode

For each line in the vcf file bcftools will have a look at the genotype of each sample. In the INFO column there is no information about the counts.

Why do you think you are getting a weird result? As always, a small example dataset which shows your problem would be helpful.

ADD REPLY
0
Entering edit mode

The command will not work if there are missing genotypes in the given position.

ADD REPLY
5
Entering edit mode
5.1 years ago
amy.roberts ▴ 50

You can also use the -c function in view

-c, --min-ac INT[:nref|:alt1|:minor|:major|:'nonmajor']
minimum allele count (INFO/AC) of sites to be printed. Specifying the type of allele is optional and can be set to non-reference (nref, the default), 1st alternate (alt1), the least frequent (minor), the most frequent (major) or sum of all but the most frequent (nonmajor) alleles.

By setting -c 1, you will only keep sites with at least one nonref allele (=polymorphic in your data)

bcftools view -c 1 input.vcf.gz -o output.vcf.gz -Oz
ADD COMMENT
0
Entering edit mode

I'm wondering how can i filter out both the monomorphic reference sites and sites which are monomorphic for the alternative allele?

ADD REPLY
2
Entering edit mode
5.8 years ago

using vcffilterjdk: http://lindenb.github.io/jvarkit/VcfFilterJdk.html

 java -jar dist/vcffilterjdk.jar -e 'return !(variant.getGenotypes().stream().allMatch(G->G.isHomVar()) || variant.getGenotypes().stream().allMatch(G->G.isHomRef()));' input.vcf
ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6