Entering edit mode
3.6 years ago
boxate1618
▴
60
I have 2 vcfs that have millions of records and thousands of samples, need to merge them and get a bcf as output. It seems converting to bcf first then merging is significantly faster (i can do first the conversion in parallel), and runs about 75 seconds in my test. Trying to output bcf during the merge takes about 125 seconds. Is there anything else that could seed this up?
# try merging while converting to bcf simultaneously
#real 2m5.219s
#user 5m42.089s
#sys 0m3.121s
time bcftools merge --threads 24 $vcf_path1 $vcf_path2 -Ob > $convert_during_path
# covert each to bcf first
#real 0m46.101s
#user 2m33.987s
#sys 0m1.645s
time bcftools view --threads 24 $vcf_path1 -Ob > $bcf_path1
#real 0m44.881s
#user 2m32.189s
#sys 0m1.533s
time bcftools view --threads 24 $vcf_path2 -Ob > $bcf_path2
# merge bcfs
#real 0m29.010s
#user 3m47.727s
#sys 0m1.569s
time bcftools merge --threads 24 $bcf_path1 $bcf_path2 -Ob > $convert_before_path
The time difference is negligible, no? Are these reduced VCFs / BCFs as part of a trial run for the ultimate merge?
It makes sense that it is quicker via BCF.
Why would you do that?
Yes, these "test" vcfs/bcf have several thousand records and samples to get an idea of benchmarking. the real ones have millions or records and tens of thousands of samples
I have to do 2 successive merges then filter across all sites. My understanding of bcf is that operations on all sites will be 10-20 faster over vcf
Ah yes, makes sense now! I would definitely use BCF for any large operation, and ensure that both files are normalised:
Other than that, I would just start with the merge and ensure that you have considerable memory available...
I chunked a big cohort "by sample" and ran genotype imputation on the chunks, now need to merge and filter by imputation quality. So chunks should already be consistent w/ strand and same order of records.
after some googleing:
I might be able to save some time with piping uncompressed bcf between step or steps where I have to write, write less compressed bcf In addition, seems the newest version of bcftools
1.12
lets you merge between pipes using the--no-index
option. I would imagine this is a little more dangerous though, not sure if I am going to try that one. I have > 1TB RAM so hopefully can play with some of these. Even a 2X speed up is days for me.http://www.htslib.org/doc/bcftools.html#merge
Wait, you did the imputation as 1 sample against the reference dataset, and then repeated that across all samples separately? Is that the correct procedure? Our chunks would normally be chromosomal regions. For example, I did 2 large imputations last year in 5 megabase chunks across the genome. One can instruct the algorithms to impute a certain amount of bp across each chunk so that they overlap.
Regarding the actual speed efficiency testing, BCFtools is already 'well-refined' and, ironically, if users are awaiting the results, it may be better to just commence the process.