Hi,
I am attempting to create a VG index against the human pangenome draft using vg autoindex. Here is the command:
vg autoindex --gfa hprc-v1.0-mc-grch38-minaf.0.1.gfa --tmp-dir /home/ec2-user/pangenome/tmp
vg has been running for about a week now and I've seen the following in the logs 12 times so far:
[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
warning:[IndexRegistry] Child process 66427 failed with status 256 representing exit code 1
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.
Over 2TB of disk space and just under 1TB of RAM are available on the machine vg is running on.
The xg index appears to have built successfully. The .gcsa and .gcsa.lcp files are both of size zero bytes.
Ultimately, I'd like to be able to map a small number of short sequences (as small as 20nt) to the pan-genome and am particularly interested in structural variants. The index distributed with the human pangenome draft appears to be for giraffe which does not appear to support sequences this short.
Any pointers on how to build the index more efficiently or other ways of mapping these short sequences would be appreciated.
Thanks!
FYI, the process was eventually killed after about 10 days. Nothing new in the logs and the .gcsa and .lcp file sizes are still zero.
Hi Lisle, I've replicated this behavior locally and will look into the cause.