I have a VCF file where the FMT/DP
of missing genotypes has been recorded itself as missing data ("."
)
When I filter this file in bcftools
, such that only sites where at least 90% of samples have a depth >= 10, I am returned 8940 sites.
bcftools filter unfiltered.bcf.gz -i 'F_PASS(FMT/DP >= 10) >= 0.9' -Ob -o filtered.bcf.gz
bcftools view filtered.bcf.gz -H | wc -l #8940 sites
When I perform an equivalent filtering operation in VariantAnnotation
in R, I get different results depending on whether I replace this "."
with zeros, or remove missing data from the calculation, and neither value is 8940:
data <- readVcf(file="unfiltered.vcf.bgz")
depth <- geno(data)$DP
# replace NAs with 0
depth[is.na(depth)] <- 0
DP0.1 <- apply(depth,1,quantile,p=0.1)
DP_filter <- (DP0.1>=10)
sum(DP_filter) #8953
# don't replace NAs with 0, but remove from quantile calculation
depth <- geno(data)$DP
DP0.1 <- apply(depth,1,quantile,p=0.1,na.rm=T)
DP_filter <- (DP0.1>=10)
sum(DP_filter) #9196
I am wondering how bcftools
parses the missing values in this case, since it does not conform to either of the options I can simulate in R.
Relatedly, is there a way to edit the FMT/DP
attribute of specific genotypes and/or, change the way bcftools
treats missing data when operating on FMT/DP information? I.e., what I would really like to do is replace the depth of all missing genotypes with 0s.
Thanks!