I have a frankly ludicrous number of single-sample vcf.gz files (with their tabix) that I want to merge into on big file. I've previously used bcftools merge on 48 threads to merge 1000 and it took 15+ minutes. I'm pretty sure that time to complete won't scale linearly once I increase the number of samples to 500k+. Any suggestions? Should I merge groups of samples at a time like going up a tree? Should I use a different toos?
How to merge 20K single-sample VCFs *without* using plink or plink2?
just saw there is a 'virtual codeathon' for "scaling vcf to millions of samples" soon, can sign up here https://ncbiinsights.ncbi.nlm.nih.gov/event/vcf-for-population-genomics-codeathon/