I tried to run vg autoindex --workflow mpmap using GRCh38, GENCODE 47, with hprc decomposed VCF. I renamed the names of chr in VCF to make them match the chr names in GENCODE annotation.
It always failed when building GCSA index.
How to take over based on spliced graph? prune then index?
vg autoindex --workflow mpmap --workflow rpvg \
--prefix hprc-v1.1-mc-grch38-gencode47 \
--ref-fasta GRCh38.primary_assembly.genome.fa \
--tx-gff gencode.v47.primary_assembly.annotation.gtf \
--vcf hprc-v1.1-mc-grch38.vcfbub.a100k.wave.gencode.vcf.gz \
--threads 32 \
--target-mem 400G \
--tmp-dir tmp
And I am curious what the parameters vg autoindex use?
Thanks for your instruction.
I got singal 9 twice. It could be that someone is using memory intensive tasks during my GCSA build. I am using a shared server without workload manager like SLURM.
It is a good idea to request a dedicated server for a few weeks. It will take a least a week. See this biostar post
It would be nice if vg could save the prune arguments and then continue after an interrupt. Otherwise, it will just keep repeating the previous prune and GCSA steps.
GCSA indexing certainly can be time consuming. On a reasonably well-behaved graph (such as those derived from VCFs, or usually those from the Minigraph-Cactus tool), it typically takes ~1-2 days. In my experience, the memory use tops out at 100-200 GB, but the temporary disk usage can be higher. It will depend to some extent on the properties of the genome. In any case, this is not a step that's easy to perform in a resource-constrained compute environment.
I increased the temp dir size limit to 4 TB by adding
--gcsa-size-limit 4398046511104
The maximum size of temp dir is around 3TB. The reason it ran and got stuck in my previous attempts is because there wasn't enough space. In this run, it tried pruning twice.
Can you add this additional info to vg wiki. I think it will help more people.
Also, thanks for this post 9595443