What does vg autoindex mpmap do?
1
0
Entering edit mode
13 days ago
Ziyue • 0

I tried to run vg autoindex --workflow mpmap using GRCh38, GENCODE 47, with hprc decomposed VCF. I renamed the names of chr in VCF to make them match the chr names in GENCODE annotation.

It always failed when building GCSA index.

How to take over based on spliced graph? prune then index?

vg autoindex --workflow mpmap --workflow rpvg \
  --prefix hprc-v1.1-mc-grch38-gencode47 \
  --ref-fasta GRCh38.primary_assembly.genome.fa \
  --tx-gff gencode.v47.primary_assembly.annotation.gtf \
  --vcf hprc-v1.1-mc-grch38.vcfbub.a100k.wave.gencode.vcf.gz \
  --threads 32 \
  --target-mem 400G \
  --tmp-dir tmp

And I am curious what the parameters vg autoindex use?

vg mpmap autoindex • 413 views
ADD COMMENT
0
Entering edit mode
13 days ago

It's possible to replicate the indexing pipeline in vg autoindex -w mpmap by using vg construct to make a graph, vg rna to add splice junctions, vg index to create an XG from the spliced graph, vg prune to create a simplified graph for GCSA indexing, and vg index to construct the GCSA for the simplified graph. It's a fairly complicated pipeline with a number of minor pitfalls, which was part of the motivation for developing vg autoindex in the first place. The default parameters aren't exposed at the command line, but this is where they are in the code, if you want to try to replicate the pipeline manually.

There's supposed to be some fault-tolerance built into the GCSA indexing step, which has the potential to require exponential time and space: if the index grows too large, it's supposed to rewind to the pruning step and then repeat it with more aggressive parameters. Were you seeing failures that escaped this rewind mechanism?

ADD COMMENT
0
Entering edit mode

Thanks for your instruction.

I got singal 9 twice. It could be that someone is using memory intensive tasks during my GCSA build. I am using a shared server without workload manager like SLURM.

It is a good idea to request a dedicated server for a few weeks. It will take a least a week. See this biostar post

It would be nice if vg could save the prune arguments and then continue after an interrupt. Otherwise, it will just keep repeating the previous prune and GCSA steps.

ADD REPLY
0
Entering edit mode

GCSA indexing certainly can be time consuming. On a reasonably well-behaved graph (such as those derived from VCFs, or usually those from the Minigraph-Cactus tool), it typically takes ~1-2 days. In my experience, the memory use tops out at 100-200 GB, but the temporary disk usage can be higher. It will depend to some extent on the properties of the genome. In any case, this is not a step that's easy to perform in a resource-constrained compute environment.

ADD REPLY
0
Entering edit mode

I increased the temp dir size limit to 4 TB by adding --gcsa-size-limit 4398046511104

Elapsed: 2 days, 19 hours, 18 minutes, 21 seconds

The maximum size of temp dir is around 3TB. The reason it ran and got stuck in my previous attempts is because there wasn't enough space. In this run, it tried pruning twice.

Can you add this additional info to vg wiki. I think it will help more people.

Also, thanks for this post 9595443

ADD REPLY

Login before adding your answer.

Traffic: 4173 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6