Dear all,
I think I need some ground-up assistance with filtering VCF files. I have generated a VCF file with samtools + bcftools after an initial alignment with bowtie2 and from what I can tell this VCF file looks as expected. Now I want to filter it in several ways and am struggling to figure out how to do it. The following is what I need to do:
- DP >= 10
- DP_STRAND >= 10
- QUAL >= 50
- MQ >= 30
- Not on a repetitive region
- Not less than 9nt away from a gap
Unfortunately I'm rather limited in the software I can use; I have bcftools and vcftools, as well as the ability to run any Python or Perl script, so any answers that use those would be ideal. As I'm not using my own computer I can't freely install other things, although if there's something I absolutely need I can probably ask for it.
Thanks in advance.
EDIT: Just saw your edit; thanks I will take it on board :)
Hi, thank you very much for your answer, it's very informative. I have now used BCFTools to filter out reads with DP>=10 and/or MQ>=30 so I have made progress, however I'm a bit stumped about how to filter according to QUAL and DP_STRAND, as these do not appear in the vcf file header and BCFTools naturally cannot then refer to them. I apologise if this a very basic/stupid question, but do these values perhaps have different names depending on how the file was produced?
The file was produced using bcftools-1.2 and samtools-1.2, command line:
The filtering I've done so far (in case I'm somehow doing it wrong) was:
Depending on what you effectively want to test with DB_STRAND, you could take a look at DP4. For example to filter variants with more than 10 reads on forward AND on reverse strand, you could do:
You can filter for all fields available. For example,
Also, check the additional fields that can be reported by samtools during pileup (-t parameter). In particular, the DPR (samtools 1.2) or the AD (samtools >= 1.3) are of interest! In particular, the AD is important if you encounter multi-allelic variants...
This did not work for me. After applying all the filter criteria to get the exact set of variants for exclusion, using
bedtools intersect -a my_variants.vcf -b excluded_variants.vcf
ended up pulling back in variants that I had already excluded.