Quickest way to filter vcf using bed files
1
0
Entering edit mode
7.0 years ago
spiral01 ▴ 110

I am filtering vcf files using bed files using vcftools. I have one bed file per vcf (split by chromosome):

for i in "${chroms[@]}"; do vcftools --gzvcf denisovan/chr"$i"_mq25_mapab100.vcf.gz --bed bed/chr"$i"_mask.bed --recode --keep-INFO-all --stdout | gzip -c > filtered/denisovan.filtered."$i".vcf.gz; done

This is proving extremely slow. Is there any tool that is much quicker for doing this?

SNP • 6.1k views
ADD COMMENT
1
Entering edit mode

note: don't use gzip , but bgzip.

ADD REPLY
0
Entering edit mode

Hi, what is the reasoning for using bgzip over gzip? Thanks.

ADD REPLY
1
Entering edit mode

bgzip allows random access and you can use it with tabix: it's faster to extract a random partion (a genomic interval) of your vcf.

e.g: https://software.broadinstitute.org/software/igv/VCF

VCF data files must be indexed for viewing in IGV, either by using igvtools or by using Tabix.

ADD REPLY
2
Entering edit mode
7.0 years ago

use gnu parallel :

 (seq 1 10 && echo X && echo Y) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz (...) '
ADD COMMENT
0
Entering edit mode

That's extremely useful. Thank you.

ADD REPLY
0
Entering edit mode

I have been running the above code as such since yesterday:

(seq 1 22 && echo X) | parallel ' vcftools --gzvcf denisovan/chr{}_mq25_mapab100.vcf.gz --bed bed/chr{}_mask.bed --recode --keep-INFO-all --stdout | bgzip -i -I filtered/{}.tbi > denisovan.filtered.{}.vcf.bgz '

Vcftools is running on 4 files, but it is still extremely slow. Each vcf.gz file is 3-4gb in size, but having had vcftools running for almost 24 hours now, the four files still have not been completed. Is this normal? I am aware that I will be limited by processor speed but as this will take several days before it is completed I wanted to check that there is no way to optimise this process. Thanks.

ADD REPLY

Login before adding your answer.

Traffic: 2978 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6