I want to remove sites from a VCF file with an overabundance or an underabundance of heterozygous genotypes. I have 8940 markers to start with:
bcftools view unfiltered.bcf.gz -H | wc -l # 8940
When I exclude sites with a frequency of hets > 0.9, this number goes down a bit:
bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") > 0.9' | bcftools view -H | wc -l #8905
When I exclude sites with a frequency of hets < 0.1, suddenly, almost every marker is removed:
bcftools filter unfiltered.bcf.gz -e 'F_PASS(GT="het") < 0.1' | bcftools view -H | wc -l #9
This does not reflect the actual distribution of heterozygous frequencies in this population (validated using bcftools stats
). Indeed, when I perform what to my mind is an equivalent command (i.e., include only sites with a frequency of hets >= 0.1) I get a more reasonable reduction:
bcftools filter unfiltered.bcf.gz -i 'F_PASS(GT="het") >= 0.1' | bcftools view -H | wc -l #8909
Any explanation for the difference in output between these later two commands would be appreciated.
You could try using bcftools query to find the
%N_PASS(GT="het")
per site. It's not listed, but maybe%F_PASS
works too. That way, you'd have a list of F_PASS that you can compare sites that pass/fail filters to and see what's going on.Thanks for the tip. Indeed, this just kind of makes things more confusing:
Returns a long list where it is clear that almost every site has 60-80 samples that pass the expression. However:
Still does not work (setting the hard threshold of 16 in this case because 0.1 * sample size = 15.9). So
bcftools
is definitely parsingN_PASS(GT="het")
correctly, but somehow when based to the-e
option offilter
, it doesn't work....I daresay this warrants an issue on github, or at least an email to their mailing list (if one exists). Check with
-i N_PASS(GT="het") >=16
and if that returns more rows, something is quite wrong.Indeed:
Returns 8909 sites. I opened an in issue on Github and if I get a response I'll give an update here. Thanks for thinking this through.
No problem. Could missing genotypes have something to do with this? Still feels intuitively weird.
There is definitely are missing genotypes (encoded with
.
characters in theFMT/GT
field, but I'm not sure why that would only create a problem fore
and noti
. The compressed .vcf is here: https://www.dropbox.com/s/ypkh861qkimz6jc/DMxM6_F2.vcf.bgz?dl=0, and it's tabix-generated index here: https://www.dropbox.com/s/0d5sccem1qe9m6s/DMxM6_F2.vcf.bgz.tbi?dl=0, if you want to take a look.