speeding up bcftools view
2
0
Entering edit mode
23 months ago
eb13 ▴ 20

Hi all - I have a very large multi sample vcf file which I am trying to subset by a list of sample IDs, however, my current approach is working very slowly (>2hr per chromosome) and I am wondering if there are any tricks to making it run faster with large files? Here is my current approach:

for file in /vcffiles/*.vcf.gz; do
    bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" 
done

Thanks in advance for any suggestions!

vcf bcftools • 2.4k views
ADD COMMENT
0
Entering edit mode
ADD REPLY
0
Entering edit mode

thank you for your helpful responses!

ADD REPLY
2
Entering edit mode
23 months ago
barslmn ★ 2.3k

Another solution with tsp and background processes.

# this sets number of max jobs. Here we use the number of processes. You might want to change this to another number.
tsp -S $(nproc)

# rest is similar. we just add tsp to start of the command and & at the end.
# & at the end calls all the processes at once but tsp queues them and calls them in batches.
for file in /vcffiles/*.vcf.gz; do
    tsp bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" &
done
ADD COMMENT
1
Entering edit mode
23 months ago

let's do it using nextflow, I won't test it so there will be some small bugs, but you get the idea.

ADD COMMENT

Login before adding your answer.

Traffic: 1661 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6