Question

speeding up bcftools view

0

Entering edit mode

23 months ago

eb13 ▴ 20

Hi all - I have a very large multi sample vcf file which I am trying to subset by a list of sample IDs, however, my current approach is working very slowly (>2hr per chromosome) and I am wondering if there are any tricks to making it run faster with large files? Here is my current approach:

for file in /vcffiles/*.vcf.gz; do
    bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" 
done

Thanks in advance for any suggestions!

vcf bcftools • 2.5k views

ADD COMMENT • link 23 months ago by eb13 ▴ 20

0

Entering edit mode

Maybe this link is useful: How to parallelize bcftools mpileup with GNU parallel?

ADD REPLY • link updated 23 months ago by Ram 44k • written 23 months ago by mohammadhassanj ▴ 260

0

Entering edit mode

thank you for your helpful responses!

ADD REPLY • link 23 months ago by eb13 ▴ 20

score 2 · Answer 1 · 2023-01-06

Another solution with tsp and background processes.

# this sets number of max jobs. Here we use the number of processes. You might want to change this to another number.
tsp -S $(nproc)

# rest is similar. we just add tsp to start of the command and & at the end.
# & at the end calls all the processes at once but tsp queues them and calls them in batches.
for file in /vcffiles/*.vcf.gz; do
    tsp bcftools view -Oz -S sample_list.txt $file > /output/subset_"${i##*/}" &
done

score 1 · Answer 2 · 2023-01-06

1

Entering edit mode

23 months ago

Pierre Lindenbaum 164k

let's do it using nextflow, I won't test it so there will be some small bugs, but you get the idea.

ADD COMMENT • link 23 months ago by Pierre Lindenbaum 164k