HPC settings for concatenating large VCF.GZ files
1
0
Entering edit mode
2.8 years ago
Katherine • 0

Dear all,

First-time poster here. :)

I was hoping to ask for your recommendations on what to request on a HPC to concatenate ~22 chromosomes (each vcf.gz files is ~15GB, so ~330GB in total).

e.g. with bcftools concat -Oz chr1.vcf.gz chr2.vcf.gz ... chrX.vcf.gz > allchr.vcf.gz

Could I request 32 CPUs each with 16GB of memory (512GB)? Would that work?

Any suggestions at all would be appreciated!!

Thanks in advance

Concatenate VCF HPC • 1.0k views
ADD COMMENT
2
Entering edit mode

Katherine - There are many ways to resolve that question, none of which should require that much dedicated memory. Irrespective, I really encourage you to tabix index your vcf files. Once this index file is created (which only needs to be done once), you will be able to perform operations that would have taken minutes in seconds, including operations similar to what you propose. VAL

ADD REPLY
1
Entering edit mode
2.8 years ago
4galaxy77 2.9k

bcftools concat stores only a small stream of data in memory. Therefore, you would need, at most, 1Gb of memory to perform this operation and one cpu core. You can request multiple threads to perform multithreaded gzipping of the output though.

ADD COMMENT
0
Entering edit mode

Thanks for this information @4galaxy77 ! How many hours would you expect a job to run using bcftools to concatenate just TWO vcf.gz files (~15GB each)? At the moment, this job has been running for 20hrs. Thanks in advance, Katherine.

ADD REPLY

Login before adding your answer.

Traffic: 2412 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6