With a task like this, I think one will have to pay the piper in some way. Creating thousands of single-sample files is inherently I/O intensive.
BTW, because I was only interested in het sites for my analysis, I was able to use over 40,000 simultaneous open files in vcf-split without a problem on my workstation. If I were keeping every site I'd want to limit it to under 10k.
For anyone unfamiliar, Unix sort commands limit memory use for large files by sorting small sections, saving the data to temporary files, and merging them. If the list is huge, there are a lot of temp files and it becomes very I/O intensive to write them and merge them. You can mitigate this to some extent using sort --buffer-size
, telling it to use larger fragments and hence fewer temp files.
I'm not sure a better strategy exists for files with relatively few calls and more than a few thousand samples.
One other thing I noticed but forgot to mention: Running a VCF with 10,000 samples, the perl script was actually CPU-bound for the first step (separating the fixed fields and reshaping the samples into a linear format). Perl CPU time seems to be the bottleneck, whereas disk I/O is minimal. I haven't checked the split step, but I suspect it's fine since it's simple and writes one file at a time.
FreeBSD barracuda.uits bacon ~/Barracuda/TOPMed/phased 1005: top
PID USERNAME THR PRI NICE SIZE RES STATE C TIME WCPU COMMAND
12053 bacon 1 103 0 39M 28M CPU0 0 559:53 99.80% perl
12055 bacon 1 23 0 53M 40M pipdwt 1 43:25 7.62% bcftools
12054 bacon 1 20 0 27M 11M piperd 0 1:25 0.33% gsort
17792 bacon 1 20 0 13M 3356K CPU1 1 0:00 0.04% top
FreeBSD barracuda.uits bacon ~/Barracuda/TOPMed/phased 1006: zpool iostat 1
capacity operations bandwidth
pool alloc free read write read write
---------- ----- ----- ----- ----- ----- -----
zroot 9.41T 1.47T 42 33 298K 347K
zroot 9.41T 1.47T 0 0 0 0
zroot 9.41T 1.47T 0 0 0 0
zroot 9.41T 1.47T 0 671 0 8.45M
zroot 9.41T 1.47T 0 0 103K 0
zroot 9.41T 1.47T 0 0 0 0
zroot 9.41T 1.47T 0 0 0 0
You might be able to speed it up by reimplementing in C. Biolibc and libxtend can help with this. I may do this as an alternative in vcf-split (with proper credit to Jorge) so users can pick the approach that best suits their data.
A: How To Split Multiple Samples In Vcf File Generated By Gatk?
Splitting A Vcf File
A: Splitting A Vcf File
the code i put is from the post suggested and it does not work and my files are not generated using GATK.