Question

bcftools view -Oz -S [sample list] [input vcf] -o [output vcf] incredibly slow for vcf with many samples

0

Entering edit mode

5.1 years ago

curious ▴ 820

I have a tabix indexed gzipped vcf that contain about 40K samples (approx 74 GB).

I just want to isolate 6 samples from this vcf, which I have in a sample list

I run on a job on a cluster that basically :

bcftools view -Oz -S [sample list] [input vcf] -o [output vcf]

5 hours later it is still running and I have like a 2000 kb output file. So I shut it down and ask here. Why is this so slow? When I use bcftools stats, I can tell it is just really slowing adding more variants with each write. Should I be going a different way or is this just reality of working with a file this big?

I tried increasing compression threads --threads, but it is not super obvious that this provides a speedup.

vcf bcftools • 1.9k views

ADD COMMENT • link updated 5.1 years ago by Pierre Lindenbaum 164k • written 5.1 years ago by curious ▴ 820

score 0 · Answer 1 · 2019-11-08

0

Entering edit mode

5.1 years ago

Pierre Lindenbaum 164k

bcf needs to parse every genotypes, may that's slow for 40k samples.

try to run in parallel for each contig ?

otherwise, try cut ?

first , get the column offsets for your samples:

bcftools view --header-only input.vcf.gz |  grep  "#CHROM" | cut -f 10- | tr "\t" "\n" | cat -n | grep -f samples_list.txt

then, use cut:

gunzip -c  input.vcf.gz | cut -f 1-10,<and-the-columns-indexes> | bgzip > out.vcf.gz

I'm not sure it it will be faster...

ADD COMMENT • link 5.1 years ago by Pierre Lindenbaum 164k

0

Entering edit mode

I will give that a shot, thank you so much.

ADD REPLY • link 5.1 years ago by curious ▴ 820