vcf filter for FORMAT argument (DP/GQ/GQX > 30) multiple (380) vcf file
0
0
Entering edit mode
4.7 years ago
IndyDNA ▴ 10

Hi everybody,

I need to filter for DP and GQ or GQX for genotypes in FORMAT field, within several (380) vcf files, everyone belonging to a different patient. Then I have to merge them in a single file to analyze data. So for me it's the same doing it before merging or on the merged file. I try with vcftool

 --gzvcf *.vcf.gz --minGQX 30 --minDP 30 --recode --recode-INFO-all --out fltD30GQ30

it work with one file, but I need to update for every (379) files for both input and output options. It does not work for the merged vcf (multisample).

have anyone hints? even with different tools or some cycles

I try with vcffilter
but I did not succed in installing it. command "vcffilter" not found

thank you all in advance

Diego

vcffilter vcftools sequencing • 3.6k views
ADD COMMENT
0
Entering edit mode

Have you tried bcftools?

Also you say

It does not work for the merged vcf

Can you elaborate on this? What do you mean by "does not work", and how do you know it does not work?

ADD REPLY
0
Entering edit mode

Thank you RamRS,

I tried bcftools. Which I use for merging vcfs. I use

bcftools filter *.vcf.gz -i 'FORMAT/DP>30 & FORMAT/GQ>30' -Ov -o fltD3Q3

As vcftools it worked with one, but not to all. With *.vcf in input file, as I don't know how to give multiple output names, it results in one vcf with all header and the sample-name and, unfortunately, with no data nor variants. When I use bcftools filter on a merged vcf some genotype with DP<30 and GQX<30 are still present. Furthermore bcftols did not write how many variants were filtered

About vcftools on merged vcf, as bcftools, file is created right but genotype with DP<30 or GQ<30 are retained. Those are the warning of vcftools filtering merged vcf

Warning: Expected at least 2 parts in FORMAT entry: ID=GQX,Number=1,Type=Integer,Description="Minimum of {Genotype quality assuming variant position,Genotype quality assuming non-variant position}">
Warning: Expected at least 2 parts in FORMAT entry: ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
Warning: Expected at least 2 parts in FORMAT entry: ID=VF,Number=1,Type=Float,Description="Variant Frequency, the ratio of the sum of the called variant depth to the total depth">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AC,Number=A,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=AF,Number=A,Type=Float,Description="Allele Frequency, for each ALT allele, in the same order as listed">
Warning: Expected at least 2 parts in INFO entry: ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Description="Non-variant site block. All sites in a block are constrained to be non-variant, have the same filter value, and have all sample values in range [x,y] , y <= max(x+3,(x*(1+0.3))). All printed site block sample values are the minimum observed in the region spanned by the block">
Warning: Expected at least 2 parts in INFO entry: ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Description="Non-variant site block. All sites in a block are constrained to be non-variant, have the same filter value, and have all sample values in range [x,y] , y <= max(x+3,(x*(1+0.3))). All printed site block sample values are the minimum observed in the region spanned by the block">
Warning: Expected at least 2 parts in INFO entry: ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Description="Non-variant site block. All sites in a block are constrained to be non-variant, have the same filter value, and have all sample values in range [x,y] , y <= max(x+3,(x*(1+0.3))). All printed site block sample values are the minimum observed in the region spanned by the block">
Warning: Expected at least 2 parts in INFO entry: ID=BLOCKAVG_min30p3a,Number=0,Type=Flag,Description="Non-variant site block. All sites in a block are constrained to be non-variant, have the same filter value, and have all sample values in range [x,y] , y <= max(x+3,(x*(1+0.3))). All printed site block sample values are the minimum observed in the region spanned by the block">
After filtering, kept 382 out of 382 Individuals
Outputting VCF file...
After filtering, kept 5635 out of a possible 5635 Sites
ADD REPLY
1
Entering edit mode

You'll need to either merge the VCF files before running the bcftools filter operation or run bcftools once per file. When you use shell globs (such as the *.vcf), the shell will first expand them before anything else happens.

The way to give "multiple output filenames" is to run bcftools once per input file (using a loop or xargs or GNU parallel). I'd recommend using a loop as it the easiest of the options. Of course, if the loop was the problem in the first place, you don't need to switch to bcftools if you prefer vcftools.

ADD REPLY
1
Entering edit mode

Always Thank You RamRS, I don't know if I am right, I tried with loop:

    for i in *.vcf.gz; 
    do
bcftools filter -i 'FMT/DP>=30 & FMT/GQX>=30' -Ov -o fDGQX3.vcf $i;
done

it result in saving a single vcf file, with name in -o string containing data of only last sample processed. Without giving -o option it write all data on terminal.

I FOUND a WAY!

for i in *.vcf.gz; do echo "filtering $i"; 
bcftools filter -i 'FMT/DP>=30 & FMT/GQX>=30' -Ov -o "${i//.vcf.gz}"_out.vcf $i; 
done
ADD REPLY
0
Entering edit mode

That's wonderful! Congratulations. You had to build an output file for each input file, and that is indeed what you've done. Plus, you figured out shell parameter expansion syntax, which will be quite useful going forward.

I don't understand the ${i//.vcf.gz} though - did you mean to remove .vcf.gz from $i? If so, while what you're using works, it's better to use the correct syntax, which puts the .vcf.gz at the search spot and not the replace spot in ${string/search/replace} syntax.

ADD REPLY
0
Entering edit mode

Yeah! each day some new achievement! =D

I don't understand the ${i//.vcf.gz} though - did you mean to remove .vcf.gz from $i? <

I don't know exactly what do those I write -where can I find some info?- But without it filtered vcf have messy names like: "sample-number.vcf.gz_out.vcf"

ADD REPLY
1
Entering edit mode

This is the page I use as my parameter expansion go-to: http://wiki-dev.bash-hackers.org/syntax/pe

ADD REPLY

Login before adding your answer.

Traffic: 1612 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6