I have up to 10000 single sample imputed vcf files with exactly the same variant set and order. I want to load them into PLINK and wonder what is the most efficient way to do it. Each vcf contains ~38 million variants.
Currently I am utilizing the fact that the variants are in the same order and cut-and-paste via shell common variant metadata and INFO column of each vcf into a single large vcf (weighs several TB). It does the job but requires a lot of memory. I wonder if something can be done on the binary level. I believe merging via PLINK1.9 or bcftools doesn't take the fact that vcfs have the same variants into account and thus must slow but maybe I'm wrong. Can it be more efficient than what I am currently doing? PLINK2, even the development version, currently doesn't have the way to merge different samples, as far as I understand.
P.S. Since I have a multi-node cluster, performing a lot of easy operations (like convert a single-sample vcf to bcf) is not a problem, but performing a single heavy operation is problematic
thanks a lot, I didn't know about sample-major .bed format. I will try it, hopefully, it will do the trick!
worked perfectly, thanks!