VCF Filtering by Multiple Parameters
1
1
Entering edit mode
8.1 years ago
rc16955 ▴ 90

Dear all,

I think I need some ground-up assistance with filtering VCF files. I have generated a VCF file with samtools + bcftools after an initial alignment with bowtie2 and from what I can tell this VCF file looks as expected. Now I want to filter it in several ways and am struggling to figure out how to do it. The following is what I need to do:

  • DP >= 10
  • DP_STRAND >= 10
  • QUAL >= 50
  • MQ >= 30
  • Not on a repetitive region
  • Not less than 9nt away from a gap

Unfortunately I'm rather limited in the software I can use; I have bcftools and vcftools, as well as the ability to run any Python or Perl script, so any answers that use those would be ideal. As I'm not using my own computer I can't freely install other things, although if there's something I absolutely need I can probably ask for it.

Thanks in advance.

snp • 7.8k views
ADD COMMENT
3
Entering edit mode
8.1 years ago

For the first four criteria, I recommend to use the BCFTools filter sub-command. To identify repetitive regions, you first need to have a database of these regions (preferably as BED file). Afterwards, you can employ BEDTools intersect (check out the -v parameter) to remove variants from the VCF or use BCFTools annotate to add the information to the VCF info field.

The same procedure can be used for the filtering for variants close to gaps: first identify them and write them to a BED file (you can potentially add the 9nt margins by using a custom AWK or Perl script). Then use the BED file to remove/annotate the variants.

When working with BED and VCF files at the same time, please keep in mind that VCFs uses 1-based positions while BED files uses 0-based and end-exclusive intervals.

EDIT: I recommend using variant normalization and - for multiple filter criteria - to use pipes, e.g.:

bcftools view my_variants.vcf \
   | bcftools norm -m - \
   | bcftools norm -f $reference_genome \
   | bcftools filter -e $FILTER_CRITERIA_1 \
   | bcftools filter -e $FILTER_CRITERIA_2 \
   | bcftools filter -e $FILTER_CRITERIA_3 \
   | ... \
   | bedtools intersect -v -a - -b $my_repetitive_regions.bed \
   | bedtools intersect -v -a - -b $my_gap_regions.bed \
   | bcftools view -o my_variants.filtered.vcf
ADD COMMENT
0
Entering edit mode

EDIT: Just saw your edit; thanks I will take it on board :)

Hi, thank you very much for your answer, it's very informative. I have now used BCFTools to filter out reads with DP>=10 and/or MQ>=30 so I have made progress, however I'm a bit stumped about how to filter according to QUAL and DP_STRAND, as these do not appear in the vcf file header and BCFTools naturally cannot then refer to them. I apologise if this a very basic/stupid question, but do these values perhaps have different names depending on how the file was produced?

The file was produced using bcftools-1.2 and samtools-1.2, command line:

samtools mpileup -d1000 -uf s_ratti_reference.fa ED132_66_10304_55.sorted.bam | bcftools call -mv > ED132_55.vcf

The filtering I've done so far (in case I'm somehow doing it wrong) was:

bcftools filter -e "INFO/MQ <= 30" ED132_55.vcf > ED132_55_filtered.vcf 
bcftools filter -e "INFO/DP <= 10" ED132_55_filtered.vcf > ED132_55_filtered2.vcf
ADD REPLY
1
Entering edit mode

Depending on what you effectively want to test with DB_STRAND, you could take a look at DP4. For example to filter variants with more than 10 reads on forward AND on reverse strand, you could do:

bcftools filter -e "DP4[0] + DP[2]) <= 10" | bcftools filter -e "DP4[1] + DP[3]) <= 10"

You can filter for all fields available. For example,

bcftools filter -e "QUAL < 50" | bcftools filter -e "FORMAT/AD[1] > 10"

Also, check the additional fields that can be reported by samtools during pileup (-t parameter). In particular, the DPR (samtools 1.2) or the AD (samtools >= 1.3) are of interest! In particular, the AD is important if you encounter multi-allelic variants...

ADD REPLY
0
Entering edit mode

This did not work for me. After applying all the filter criteria to get the exact set of variants for exclusion, using bedtools intersect -a my_variants.vcf -b excluded_variants.vcf ended up pulling back in variants that I had already excluded.

ADD REPLY

Login before adding your answer.

Traffic: 1791 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6