I'm encountering an error when converting from .gfa to .gbz, during the build of a GBWT/GBZ index from the HPRC v1.0 pangenome graph constructed with PGGB. The error occurs during the GBWT construction phase when trying to merge different "sources" (partitions).
Command and error
vg gbwt -G hprc-v1.0-pggb.gfa --gbz-format -g hprc-v1.0-pggb.gbz
GBWT::GBWT(): Sources 0 and 1 both have node 321364
- vg version: v1.61.0
Analysis
This error message comes from line 329 here: https://github.com/jltsiren/gbwt/blob/master/src/gbwt.cpp
Though I'm not exactly sure what this means, in the GFA file, node sharing is extremely common in this pangenome. Nodes appear in many different paths.
Looking at the GBWT code (particularly the constructor in gbwt.cpp
and from the wiki page), my guess is that during construction from GFA, the algorithm:
- Partitions the graph (union-find) into weakly connected components (component = a subgraph where all nodes are connected to each other). When assigning paths to jobs, it uses the first segment of each path to determine which job/component it belongs to (but later nodes in those paths may belong to different components).
- When building each job's GBWT, it traverses the entire path including all nodes, not just the nodes in that component.
- Processes each component as a separate "source"
- Assumes during merging that each node appears in exactly one source, and when the constructor explicitly for this, it results in the error.
Because the GBWT construction algorithm expects each node to be unique to a single partition/source, but in this pangenome graph, nodes are naturally shared across many paths representing different samples/haplotypes.
How do I construct a .gfa file out of a .gbz file?
I also am likely misunderstanding something here: isn't it fairly natural for many nodes to be shared across many paths?
It looks like this command works without an error:
../vg gbwt -G ../hprc-v1.1-mc-grch38.gfa --gbz-format -g ../hprc-v1.1-mc-grch38.gbz
Any guidance would be greatly appreciated!
Interestingly, I redownloaded the file and got a different error.
Before re-downloading, I saw lots of print outputs referring to jobs of various sizes before crashing on the merge step. After redownloading:
This is with the exact same command as before. I downloaded https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/hprc-v1.0-pggb.gfa.gz, ran gunzip, then ran the same vg gbwt command.
The HPRC v1.0 PGGB graph is an old graph, from before path naming conventions were fully established (see the vg wiki).
The full name of the offending path is
HG00438#2#JAHBCA010000258.1#MT
, which does not match any of the patterns we expect. The only name format with three#
separators should have an integer as the last field. After a few failed parsing attempts, the name matches the (slightly too general) PanSN regex, withHG00438#2
as the sample name,JAHBCA010000258.1
as the haplotype number, andMT
as the contig/sequence name. And then it fails, because the haplotype number is not a number.Someone had probably renamed the paths in the old GFA file to match the current conventions. You need to do it again to be able to parse the GFA and build the GBZ. Or, if you compiled vg from source, you could go to
deps/gbwtgraph/src/gfa.cpp
and changeGFAParsingParameters::PAN_SN_REGEX
to require that the middle field is an integer.