Fastest way to merge 2 vcfs and get a bcf
1
1
Entering edit mode
3.5 years ago
boxate1618 ▴ 60

I have 2 vcfs that have millions of records and thousands of samples, need to merge them and get a bcf as output. It seems converting to bcf first then merging is significantly faster (i can do first the conversion in parallel), and runs about 75 seconds in my test. Trying to output bcf during the merge takes about 125 seconds. Is there anything else that could seed this up?

# try merging while converting to bcf simultaneously
#real    2m5.219s
#user    5m42.089s
#sys     0m3.121s
time bcftools merge --threads 24 $vcf_path1 $vcf_path2 -Ob > $convert_during_path

# covert each to bcf first
#real    0m46.101s
#user    2m33.987s
#sys     0m1.645s
time bcftools view --threads 24 $vcf_path1 -Ob > $bcf_path1

#real    0m44.881s
#user    2m32.189s
#sys     0m1.533s
time bcftools view --threads 24 $vcf_path2 -Ob > $bcf_path2

# merge bcfs
#real    0m29.010s
#user    3m47.727s
#sys     0m1.569s
time bcftools merge --threads 24 $bcf_path1 $bcf_path2 -Ob > $convert_before_path
bcftols • 2.1k views
ADD COMMENT
0
Entering edit mode

The time difference is negligible, no? Are these reduced VCFs / BCFs as part of a trial run for the ultimate merge?

It makes sense that it is quicker via BCF.

Trying to output bcf during the merge

Why would you do that?

ADD REPLY
0
Entering edit mode

Are these reduced VCFs / BCFs

Yes, these "test" vcfs/bcf have several thousand records and samples to get an idea of benchmarking. the real ones have millions or records and tens of thousands of samples

Why would you do that?

I have to do 2 successive merges then filter across all sites. My understanding of bcf is that operations on all sites will be 10-20 faster over vcf

ADD REPLY
0
Entering edit mode

Ah yes, makes sense now! I would definitely use BCF for any large operation, and ensure that both files are normalised:

bcftools norm -m-any \
  --check-ref w \
  -f hg38.fasta \
  -Ob var.bcf > var.norm.bcf ;
bcftools index var.norm.bcf ;

Other than that, I would just start with the merge and ensure that you have considerable memory available...

ADD REPLY
0
Entering edit mode

I chunked a big cohort "by sample" and ran genotype imputation on the chunks, now need to merge and filter by imputation quality. So chunks should already be consistent w/ strand and same order of records.

after some googleing:

I might be able to save some time with piping uncompressed bcf between step or steps where I have to write, write less compressed bcf In addition, seems the newest version of bcftools 1.12 lets you merge between pipes using the --no-index option. I would imagine this is a little more dangerous though, not sure if I am going to try that one. I have > 1TB RAM so hopefully can play with some of these. Even a 2X speed up is days for me.

http://www.htslib.org/doc/bcftools.html#merge

ADD REPLY
0
Entering edit mode

Wait, you did the imputation as 1 sample against the reference dataset, and then repeated that across all samples separately? Is that the correct procedure? Our chunks would normally be chromosomal regions. For example, I did 2 large imputations last year in 5 megabase chunks across the genome. One can instruct the algorithms to impute a certain amount of bp across each chunk so that they overlap.

Regarding the actual speed efficiency testing, BCFtools is already 'well-refined' and, ironically, if users are awaiting the results, it may be better to just commence the process.

ADD REPLY

Login before adding your answer.

Traffic: 1560 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6