Question

Filtering out samples by FORMAT attributes using bcftools

4

Entering edit mode

3.8 years ago

scottbrainard ▴ 60

The -e and -i options of the bcftools filter command appear, by default, to only allow for including or excluding _sites_. E.g., -e 'FMT/DP < 10' removes sites where any sample has DP < 10, and -e 'MEAN(FMT/DP) < 10' removes sites where average depth across samples is < 10.

I would like to perform effectively similar filtering commands, but in a way that includes or excludes samples, instead of sites. E.g., exclude all _samples_ that have DP < 10 at any _site_, or exclude all samples with an average depth across all sites < 10, or exclude all samples that have a genotype quality of < 20 at 10% of genotyped sites.

Such commands can very easily be written to apply to sites, but there does not seem to be a straightforward way of applying such filtering expressions to samples in bcftools.

One option mentioned in the manual is to use the double || operator, as described in the man page. As described in my comment on a Github issue, however, this does not seem to work as a filtering operation for including or excluding entire columns in a VCF file, but rather only for selecting more or less individual records (e.g., as illustrated in the example on this tutorial page). In addition, it would seem limited in functionality (e.g., functions such as MEAN(FMT/DP) < 10 would still be evaluated across samples, not sites, when combined with a || operator).

The only other method mentioned in the manual is to use the SMPL or s prefix to functions which evaluate FORMAT fields in expressions passed to filter. This also does not appear to work as I would expect. I'm not sure I understand what it _is_ doing, but it definitely _isn't_ removing columns from the VCF file. Instead it seems to be using as a metric calculated across sites on a per sample basis as a filtering criteria applied to sites (but I could just be confused):

nSites_unfiltered=$(bcftools view -H unfiltered.vcf.gz | wc -l) 
nSamples_unfiltered=$(bcftools query unfiltered.bcf.gz -l | wc -l)

nSites_DPfilter=$(bcftools filter unfiltered.bcf.gz -e 'sMEAN(FORMAT/DP)<5' | bcftools view -H | wc -l)
nSamples_DPfilter=$(bcftools filter unfiltered.bcf.gz -e 'sMEAN(FORMAT/DP)<5' | bcftools query -l | wc -l)

echo "scale=5; $nSites_DPfilter / $nSites_unfiltered" | bc #0.17349 - sites are removed, even though the filter is being applied to samples
echo "scale=5; $nSamples_DPfilter / $nSamples_unfiltered" | bc #1.00000 - no samples removed, despite, again, the filter being applied to samples

nSites_normalDPFilter=$(bcftools filter unfiltered.bcf.gz -e 'MEAN(FMT/DP) < 5' | bcftools view -H | wc -l)
echo "scale=5; $nSites_normalDPFilter / $nSites_unfiltered" | bc #0.83121 - fewer sites removed than the sMEAN filter

The .bcf.gz passed to these commands is here: https://www.dropbox.com/s/fwj1snabsb763sn/unfiltered.bcf.gz?dl=0 Its .csi index file is here: https://www.dropbox.com/s/jzdq6y91nkencmy/unfiltered.bcf.gz.csi?dl=0

In any event, the only option I have been able to come up with that actually excludes or includes entire sample IDs is to extract summary statistics using bcftools stats, export the desired statistics from the PSC, Per-sample counts table, and perform my own exploratory analysis outside of bcftools. Then I can use bcftools view -S to explicitly remove samples by their ID, which I have identified manually as failing some thresholding criteria. This also is obviously cumbersome, and only works for FMT fields that are included in stats (e.g., not GQ or PL).

I am not sure if I am missing some basic functionality of bcftools, or if this is more of a feature request. Any information would be appreciated.

Thanks!

vcf SNP filtering bcftools gbs • 4.7k views

ADD COMMENT • link updated 2.8 years ago by steve ★ 3.5k • written 3.8 years ago by scottbrainard ▴ 60

score 0 · Answer 1 · 2022-02-02

I found it easier to do this kind of filtering by first splitting the multi-sample .vcf into multiple single-sample .vcf's with

bcftools +split input.vcf --output-type v --output output_dir/

https://samtools.github.io/bcftools/bcftools.html#plugin

Then use a simple command to just check how many variants match the criteria

$ bcftools query -f "%CHROM\n" -i 'DP>10' output_dir/Sample1.vcf | wc -l
0

Wrap this in a for loop over all the sample ID's in the original VCF and you can figure out which samples have any variants that match your criteria. Then implement the method you already mentioned with bcftools view -S to explictly select which variants to keep.

Its not elegant, but its relatively simple to script up and should work with any INFO or FORMAT tag.