Question

Efficient way to merge multiple single sample vcf files with same variants / load into PLINK

0

Entering edit mode

3.6 years ago

dmitry.s.kolobkov • 0

I have up to 10000 single sample imputed vcf files with exactly the same variant set and order. I want to load them into PLINK and wonder what is the most efficient way to do it. Each vcf contains ~38 million variants.

Currently I am utilizing the fact that the variants are in the same order and cut-and-paste via shell common variant metadata and INFO column of each vcf into a single large vcf (weighs several TB). It does the job but requires a lot of memory. I wonder if something can be done on the binary level. I believe merging via PLINK1.9 or bcftools doesn't take the fact that vcfs have the same variants into account and thus must slow but maybe I'm wrong. Can it be more efficient than what I am currently doing? PLINK2, even the development version, currently doesn't have the way to merge different samples, as far as I understand.

P.S. Since I have a multi-node cluster, performing a lot of easy operations (like convert a single-sample vcf to bcf) is not a problem, but performing a single heavy operation is problematic

vcf plink merging bcftools • 2.6k views

ADD COMMENT • link 3.6 years ago by dmitry.s.kolobkov • 0

score 2 · Accepted Answer · 2021-04-06

2

Entering edit mode

3.6 years ago

chrchang523 11k

Concatenation is cheap. So one way to distribute this work is by merging each chromosome separately (or you could subdivide into even smaller regions), and then concatenate at the end.

Alternatively, if you're willing to get your hands a bit dirty, you can take advantage of plink's sample-major .bed format (see http://zzz.bwh.harvard.edu/plink/binary.shtml for a low-level description; this can be generated by plink2 --vcf ... --export ind-major-bed --out ...). The correct merged sample-major .bed can be created by concatenating the first single-sample .bed with header-stripped instances of the other single-sample .bed files.

ADD COMMENT • link 3.6 years ago by chrchang523 11k

0

Entering edit mode

thanks a lot, I didn't know about sample-major .bed format. I will try it, hopefully, it will do the trick!