HI,
We have a small local server which I started running vg autoindex
on a week ago and it's still running. To me it's likely to just keep going ad infinitum as it doesn't seem to be pruning very efficiently. It's running inside the provided docker container on OSX with the genome directory mounted as /genome
. I've created a folder within that directory (on the server) called tmp
for all temp output to be written to. The command given was:
vg autoindex -t 24 -w mpmap -w rpvg \
-M 240G \
-T /genome/tmp \
-p /genome/full/hprc-v1.1-mc-grch38-gencode45 \
-x /genome/gencode.v45.primary_assembly.annotation.gtf \
-r /genome/GRCh38.primary_assembly.genome.fa \
-v /genome/hprc-v1.1-mc-grch38.vcfbub.a100k.wave.vcf.gz \
-V 2 &> /genome/full/complete_build.log
We've just hit seven days which really seems excessive to me, and all files have been created except the gcsa
files
$ ll full
total 32457336
-rw-r--r-- 1 stevieped wgs 1.2M 22 May 10:24 complete_build.log
-rw-r--r-- 1 stevieped wgs 701M 15 May 16:39 hprc-v1.1-mc-grch38-gencode45.haplotx.gbwt
-rw-r--r-- 1 stevieped wgs 5.3G 15 May 17:42 hprc-v1.1-mc-grch38-gencode45.spliced.dist
-rw-r--r-- 1 stevieped wgs 0B 20 May 17:28 hprc-v1.1-mc-grch38-gencode45.spliced.gcsa
-rw-r--r-- 1 stevieped wgs 0B 20 May 17:28 hprc-v1.1-mc-grch38-gencode45.spliced.gcsa.lcp
-rw-r--r-- 1 stevieped wgs 9.0G 15 May 17:07 hprc-v1.1-mc-grch38-gencode45.spliced.xg
-rw-r--r-- 1 stevieped wgs 442M 15 May 16:39 hprc-v1.1-mc-grch38-gencode45.txorigin.tsv
The section that seems stuck on an infinite loop in what appears in the log file as
[IndexRegistry]: Constructing GCSA/LCP indexes.
[IndexRegistry]: Forked into child process with PID 9007.
[IndexRegistry]: Forked into child process with PID 9008.
InputGraph::InputGraph(): 8676064820 kmers in 48 file(s)
InputGraph::readKeys(): 1721650899 unique keys
InputGraph::readFrom(): 7331206567 unique start nodes
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
GCSA::GCSA(): Step 4 (path length 128 -> 256)
This has appeared previously twice in the log file as
[IndexRegistry]: Constructing GCSA/LCP indexes.
[IndexRegistry]: Forked into child process with PID 48452.
[IndexRegistry]: Forked into child process with PID 48453.
InputGraph::InputGraph(): 8640165890 kmers in 48 file(s)
InputGraph::readKeys(): 1722608560 unique keys
InputGraph::readFrom(): 7331277927 unique start nodes
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
GCSA::GCSA(): Step 4 (path length 128 -> 256)
GCSA::GCSA(): Merging the paths
MergedGraph::MergedGraph(): Size limit exceeded, construction aborted
[IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
and above that was the first time it appeared
[IndexRegistry]: Constructing GCSA/LCP indexes.
[IndexRegistry]: Forked into child process with PID 1264.
[IndexRegistry]: Forked into child process with PID 1265.
InputGraph::InputGraph(): 8582023968 kmers in 48 file(s)
InputGraph::readKeys(): 1723923355 unique keys
InputGraph::readFrom(): 7331378415 unique start nodes
GCSA::GCSA(): Prefix-doubling from path length 16
GCSA::GCSA(): Step 1 (path length 16 -> 32)
GCSA::GCSA(): Step 2 (path length 32 -> 64)
GCSA::GCSA(): Step 3 (path length 64 -> 128)
GCSA::GCSA(): Step 4 (path length 128 -> 256)
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (240.0 GB) exceeds memory limit (240.0 GB)
PathGraphBuilder::write(): Memory use of file 0 of kmer paths (240.1 GB) exceeds memory limit (240.0 GB)
[IndexRegistry]: Exceeded disk or memory use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
[IndexRegistry]: Pruning complex regions of spliced VG to prepare for GCSA indexing with GBWT unfolding.
This suggests to me there may be problems with resource allocation and the internal estimation vg
is performing. There are no limits on the HDD space (until 3.8TB) and the OSX docker interface has been set carefully to allow the correct resources for the process. Swap seems to be being set dynamically by the process and is oscillating between using ~13Gb of 15Gb and using 30Gb of 32Gb, and everywhere in between. None of us really understand what's happening there.
We need the server for other work but don't want to shut the process down at this point. This >7 day duration feels concerning and are anxious that this step is just going to keep repeating over & over. Testing on individual chromosomes ran to completion within hours so were unprepared for losing our server for days or even weeks.
Thanks for this. As a follow-up, the process did indeed complete after 8.5 days so I guess my concerns were unwarranted in the end.