Filtering Vcf File
4
12
Entering edit mode
12.4 years ago
bioinfo ▴ 840

I was wondering how to filter the vcf file based on a few input arguments ( DP>10, MQ>30 and QD>20 or GT = "1/1" etc)? I m planning to use simple command on the command line to extract the info and create a new filtered vcf file. I want to keep the 20 lines of vcf header INFO in new file as well. I can do it with perl but is there any other easy way? Last time I extracted my required info from vcf file using vcftools but I couldnt get a filtered vcf file.

My command

vcftools --vcf GMM_homo.vcf --depth --FILTER-summary --TsTv-by-count --site-mean-depth --SNPdensity 1000 --site-pi --minQ 30 --min-meanDP 5 --out homo_GMM

vcftools vcf snp indel • 52k views
ADD COMMENT
0
Entering edit mode

I just tried

egrep '^#|"GT =1/1" | "DP>10","MQ>30"' my.vcf > filtered.vcf

Didn't work though.

ADD REPLY
0
Entering edit mode

I need to filter my vcf file to include variants with at least 30 individuals in each of the possible groups: major allele homozygote, heterozygote, and minor allele homozygotes; would be grateful for any input. Thanks!

ADD REPLY
0
Entering edit mode

ask this as a new question please.

ADD REPLY
35
Entering edit mode
12.0 years ago
Erik Garrison ★ 2.4k

You can do exactly this with vcffilter in vcflib!

Here's how to select all variants with depth greater than 10, mapping quality greater than 30, and QD greater than 20:

vcffilter -f "DP > 10 & MQ > 30 & QD > 20" file.vcf >filtered.vcf

Now, to select only variants with homozygotes, you can strip every genotype that's not homozygous, fix up the file's AC and AF fields using the genotypes with vcffixup, and then remove all the AC = 0 sites (again, using vcffilter).

cat filtered.vcf | vcffilter -g "GT = 1/1" | vcffixup - | vcffilter -f "AC > 0" >results.vcf

The expression language is clunky (you have to put spaces in between the tokens, and parenthetical expressions also have to have spaces). There is also no != symbol, but as a workaround you can do ! ( expression ).

For instance, to pick up non-homozygous genotypes, you'd use:

vcffilter -g "! ( GT = 1/1 )"

I'd like to fix some of these things (and also add regex matching for strings) but this far it more than does the job for quick filtering operations, allowing me to do virtually any kind of filtering from the command line without having to drop into writing a custom script.

These are the supported operations: > < = | & !, and symbols: ( ). Strings are interpreted literally. There is some type checking using the VCF header, so you have to have a valid VCF file. The output is a valid VCF file, so you can stream the filter results into another filtering operation.

ADD COMMENT
0
Entering edit mode

Note that this will work for any values in the INFO field or per-sample fields.

ADD REPLY
0
Entering edit mode

Does the vcffilter -f work with mutect vcf output? I tried it but does not seem to work. The vcf output of Mutect has a column as FILTER and I want to only keep the variants that have the value PASS for that column, ideally it should be like this

vcffilter -f "FILTER = PASS" file.vcf > filt_out.vcf

But this does not seem to work. Can anyone tell me where am getting it wrong?

ADD REPLY
0
Entering edit mode

problem solved, works well with epgrep command.. thanks

ADD REPLY
0
Entering edit mode

Hi, I have a mutect VCF file with the same FILTER column and PASS value I tried to run the vcffilter command but as you said it does not work. I saw that you solved the problem with grep. Please could you give me more information? Thanks

ADD REPLY
0
Entering edit mode

I have a problem with vcffilter. When I use it it removes variant info (Format: Allele|Consequence|IMPACT|SYMBOL|Gene|Feature_type| ...). Here is my command:

vcffilter -k -f "( TYPE = ins | TYPE = del ) & FDP > 10 & HRUN < 6" -f "QUAL > 20" -g "FAO > 4 & GQ > 5" file.vcf | vcf-annotate --fill-AC-AN | vcffilter -f "AC > 0" > file.vcf.indelfilter.vcf"

Any idea where is the mistake and how to fix it?

ADD REPLY
0
Entering edit mode

How to make sure the specific column is filtered along with information from rest of the columns not being removed from vcf?

Thanks Priya

ADD REPLY
4
Entering edit mode
12.4 years ago

2021: my answer is deprecated.

I wrote some tools to extract the fields from INFO and FORMAT. See: https://code.google.com/p/variationtoolkit/wiki/ExtractInfo and https://code.google.com/p/variationtoolkit/wiki/ExtractFormat

$ cat data.vcf.gz |\
   extractformat -t GT |\
   awk -F '        ' '($11=="1/1") |\
   extractinfo -t DP |\
  awk -F '        ' '(int($12)>10")'
ADD COMMENT
1
Entering edit mode

3.5 years later: this is wrong. Just filter the VCF using https://github.com/lindenb/jvarkit/wiki/VCFFilterJS or extract the fields using gatk varianttotable

ADD REPLY
0
Entering edit mode

@Pierre Lindenbaum

can we convert the fpfilter out file which filters output of varscan for false postives to convert into vcf4.0 format? I tried vcf-annotate but to no avail. I was trying to write a script but does not help me out. I would like to know if you can any custom tool designed for it?

ADD REPLY
4
Entering edit mode
12.0 years ago

snpSift, a utility associated with snpEff, has several options for filtering and transforming from vcf to tab-delimited text.

ADD COMMENT
2
Entering edit mode
12.0 years ago
Adam ★ 1.0k

Don't you just need to add --recode to your command?

ADD COMMENT

Login before adding your answer.

Traffic: 1708 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6