I have a very large multisample (N=240) vcf file from a reduced representation sequencing experiment (similar to RADseq). I want to apply basic filtering options and I am not capable of finding an easy straightforward way. I am going to write down what I am doing and I would appreciate any new ideas less time/resource consuming.
Filter low coverage I want to filter genotypes (not samples, not sites) that have a DP below 5. That is remove the genotype information for a particular site AND a particular sample that did not reach enough read depth.
Filter high coverage I want to filter genotypes (not samples, not sites) that have a DP above the 99th percentile within the sample. That is remove the genotype information for a particular site AND a particular sample that has a very high coverage, likely indicating a mapping/repetitive region problem.
I read lots of documentations GATK
, picard FilterVcf
, bcftools
... and I am not able to find an easy solution. For the first filter, I can use GATK
in this way:
gatk VariantFiltration -V input.vcf.gz -O output.vcf.gz --genotype-filter-name "LOWDP" --genotype-filter-expression "DP < 5" --set-filtered-genotype-to-no-call true
But, how I could do the same for the second filter, any ideas?
On my desperation, I found another ‘solution’ which it is more a problem than a solution. Split the vcf file per sample using bcftools:
for sample in `bcftools query -l input.vcf.gz`; do
bcftools view -Oz input.vcf.gz -s $sample -o ${sample}.vcf.gz
done
Then, now that I only have one sample in each vcf file, I can easily use bcftools and apply both filters. Including in my filtered file only the sites that pass the filtering thresholds:
LIST=$(ls *.vcf.gz)
for vcf_file in $LIST; do
#Extract sample name from file name
sample_name=$(basename "$vcf_file" .vcf.gz)
#echo $sample_name
#First filtering step: exclude sites with coverage below 5
bcftools view -Oz -i 'FORMAT/DP>=4' $vcf_file -o filter1_${sample_name}.vcf.gz
#Second filtering step: exclude sites with coverage above 99th percentile
#Calculate percentile
percentile_99=$(bcftools query -f '%FORMAT/DP[\t%GT]\n' $vcf_file | \
awk -F'\t' '{split($2, gt_info, ":"); print gt_info[3]}' | \
sed 's/\./0/g' | \
sort -n | \
awk 'BEGIN{c=0} {a[c++]=$1} END{print a[int(NR*0.999 - 0.5)]}')
#echo $percentile_99
#Apply filter with bcftools:
bcftools view -Oz -i "FORMAT/DP<=${percentile_99}" filter1_${sample_name}.vcf.gz -o filter2_${sample_name}.vcf.gz
done
The problem with this approach is that it takes a very long time to split the vcf file (5 days and counting, so far). Then, I will end up with 240 ‘small’ vcf files (1.2Gb each) that will go into the filtering loop that who knows how long it will take.
Being this a very basic filtering step, has anyone encountered the same problem and applied a better solution? As an added note, I have been recommended not to use custom scripts to parse vcf files and not to use vcftools since it is no longer mantained. Any thoughts on this too?
Thanks a lot!
You're trying to treat a text format designed for transmitting information, VCF, as an entire analysis platform. You need to get this file into a typical data science capable environment such as Python or R and these row-based analyses will be much easier.
I was trying to avoid this since I am not sure I will be capable of scripting something myself but it makes more sense for sure! Thanks!
use a workflow manager to parallelize.
Furthermore, what Jeremy Leipzig said, you should use a custom program (python etc...) to run this kind of job.
I could have parallelized, good idea. It won't solve the space issue but it will be faster. Thanks!
You could also take a look at HAIL, however is has quite a high learning curve IMO.