I have a tabix indexed gzipped vcf that contain about 40K samples (approx 74 GB).
I just want to isolate 6 samples from this vcf, which I have in a sample list
I run on a job on a cluster that basically :
bcftools view -Oz -S [sample list] [input vcf] -o [output vcf]
5 hours later it is still running and I have like a 2000 kb output file. So I shut it down and ask here. Why is this so slow? When I use bcftools stats
, I can tell it is just really slowing adding more variants with each write. Should I be going a different way or is this just reality of working with a file this big?
I tried increasing compression threads --threads
, but it is not super obvious that this provides a speedup.
I will give that a shot, thank you so much.