Question

How to efficiently/quickly merge ~500k vcfs?

1

Entering edit mode

23 months ago

Lei ▴ 20

I have a frankly ludicrous number of single-sample vcf.gz files (with their tabix) that I want to merge into on big file. I've previously used bcftools merge on 48 threads to merge 1000 and it took 15+ minutes. I'm pretty sure that time to complete won't scale linearly once I increase the number of samples to 500k+. Any suggestions? Should I merge groups of samples at a time like going up a tree? Should I use a different toos?

bcftools vcf variant-calling • 1.9k views

ADD COMMENT • link updated 23 months ago by Jeremy Leipzig 23k • written 23 months ago by Lei ▴ 20

2

Entering edit mode

How to merge 20K single-sample VCFs *without* using plink or plink2?

ADD REPLY • link 23 months ago by Pierre Lindenbaum 166k

1

Entering edit mode

just saw there is a 'virtual codeathon' for "scaling vcf to millions of samples" soon, can sign up here https://ncbiinsights.ncbi.nlm.nih.gov/event/vcf-for-population-genomics-codeathon/

ADD REPLY • link 23 months ago by cmdcolin ★ 4.2k

score 0 · Answer 1 · 2023-05-08

0

Entering edit mode

23 months ago

Jeremy Leipzig 23k

I would suggest TileDB-VCF, which enables downstream analysis (and export) without the need to fire up a Spark cluster. (Disclaimer: I work for TileDB)

ADD COMMENT • link 23 months ago by Jeremy Leipzig 23k

1

Entering edit mode

Interesting! I looked into TileDB-VCF a couple of months back and it looks like the tutorial has much improved! I'll give it a try as well.

ADD REPLY • link 23 months ago by Lei ▴ 20

0

Entering edit mode

Feel free to reach out to me directly. I can walk you through some notebooks and/or provide some free credits to get you started.

ADD REPLY • link 23 months ago by Jeremy Leipzig 23k

score 0 · Answer 2 · 2023-05-09

0

Entering edit mode

23 months ago

DBScan ▴ 470

Do you have VCFs or gVCFs? For gVCFs you could also use HAIL (https://hail.is/) or GLNexus (https://github.com/dnanexus-rnd/GLnexus).

ADD COMMENT • link 23 months ago by DBScan ▴ 470