How to merge many huge gVCFs with high speed.
4
0
Entering edit mode
3.5 years ago
ymat • 0

Hello,

In order to perform population gnomic analysis, I am trying to merge many and huge variants data (gVCF), such as several dozens Gb, over 20 files.

Bcftools merge and vcf-merge were used so far but very slow to merge those files into just one file.

Do you have any ides to merge huge gVCF files? I want to use variants as many as possible so I used gVCF not for VCF.

Thank you!

gvcf vcf merge • 5.1k views
ADD COMMENT
1
Entering edit mode

Merging files that size as VCFs is always going to be slow. If you don't mind losing some metadata and have a lot of memory at your disposal then converting to plink binary and merging in that format will speed things up a lot.

ADD REPLY
0
Entering edit mode

I also think converting to plink binary and merging is a good solution! I will try it. Thank you.

ADD REPLY
1
Entering edit mode
3.5 years ago

merge per region in parallel and then concatenate each chunk.

ADD COMMENT
0
Entering edit mode

That is also one of the best solution! Thank you so much.

ADD REPLY
1
Entering edit mode
3.2 years ago
ashotmarg2004 ▴ 130

I have a similar issue, and came across GLnexus: https://github.com/dnanexus-rnd/GLnexus I haven't used this myself yet, but according to the authors (their paper) seems to be faster than GATK.

ADD COMMENT
1
Entering edit mode
3.2 years ago
William ★ 5.3k

Use GATK4 GenomicsDBImport and GenotypeGVCFs in parallel for many callable regions.

The callable regions you can calculate with picard based on locations were there are 99+N nucleotide in the reference genome.

picard  ScatterIntervalsByNs R=$fasta O=callalbe_regions.bed N=99 -Xmx1G OT=ACGT

Then loop over the regions in that file and call GenomicsDBImport and GenotypeGVCF for each region(line) in the bed file.

Works best of course if you can submit many GenomicsDBImport and GenotypeGVCFs commands to a somewhat large cluster.

Finally use bcftools concat (naive) tot get a single VCF/BCF file. For BCF first convert to the per callable region output VCF file to BCF.

ADD COMMENT
1
Entering edit mode
ADD COMMENT

Login before adding your answer.

Traffic: 2357 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6