Question

Indexing the human pangenome draft

1

Entering edit mode

20 months ago

lisle.mose ▴ 20

Hi,

I am attempting to create a VG index against the human pangenome draft using vg autoindex. Here is the command:

vg autoindex --gfa hprc-v1.0-mc-grch38-minaf.0.1.gfa --tmp-dir /home/ec2-user/pangenome/tmp

vg has been running for about a week now and I've seen the following in the logs 12 times so far:

[IndexRegistry]: Pruning complex regions of VG to prepare for GCSA indexing.
[IndexRegistry]: Constructing GCSA/LCP indexes.
PathGraphBuilder::write(): Size limit exceeded, construction aborted
warning:[IndexRegistry] Child process 66427 failed with status 256 representing exit code 1
[IndexRegistry]: Exceeded disk use limit while performing k-mer doubling steps. Rewinding to pruning step with more aggressive pruning to simplify the graph.

Over 2TB of disk space and just under 1TB of RAM are available on the machine vg is running on.

The xg index appears to have built successfully. The .gcsa and .gcsa.lcp files are both of size zero bytes.

Ultimately, I'd like to be able to map a small number of short sequences (as small as 20nt) to the pan-genome and am particularly interested in structural variants. The index distributed with the human pangenome draft appears to be for giraffe which does not appear to support sequences this short.

Any pointers on how to build the index more efficiently or other ways of mapping these short sequences would be appreciated.

Thanks!

Pangenome VG • 1.5k views

ADD COMMENT • link updated 20 months ago by Jordan M Eizenga ▴ 690 • written 20 months ago by lisle.mose ▴ 20

0

Entering edit mode

FYI, the process was eventually killed after about 10 days. Nothing new in the logs and the .gcsa and .lcp file sizes are still zero.

ADD REPLY • link 20 months ago by lisle.mose ▴ 20

1

Entering edit mode

Hi Lisle, I've replicated this behavior locally and will look into the cause.

ADD REPLY • link 20 months ago by Jordan M Eizenga ▴ 690

score 2 · Answer 1 · 2023-07-20

2

Entering edit mode

20 months ago

Jordan M Eizenga ▴ 690

I've figured out what's going on, and as a hot fix you can remove all of the "W" lines from the GFA like this:

grep -v "^W" hprc-v1.0-mc-grch38-minaf.0.1.gfa > hprc-v1.0-mc-grch38-minaf.0.1.no_w.gfa

That GFA should be indexable. Sometime soon, I'll update the logic to replicate this behavior even without removing the W lines.