GBWT Construction Error: "Sources X and Y both have node Z" When Building GBZ from GFA
1
0
Entering edit mode
1 day ago
Sauers ▴ 10

I'm encountering an error when converting from .gfa to .gbz, during the build of a GBWT/GBZ index from the HPRC v1.0 pangenome graph constructed with PGGB. The error occurs during the GBWT construction phase when trying to merge different "sources" (partitions).

Command and error

vg gbwt -G hprc-v1.0-pggb.gfa --gbz-format -g hprc-v1.0-pggb.gbz
GBWT::GBWT(): Sources 0 and 1 both have node 321364
  • vg version: v1.61.0

Analysis

This error message comes from line 329 here: https://github.com/jltsiren/gbwt/blob/master/src/gbwt.cpp

Though I'm not exactly sure what this means, in the GFA file, node sharing is extremely common in this pangenome. Nodes appear in many different paths.

Looking at the GBWT code (particularly the constructor in gbwt.cpp and from the wiki page), my guess is that during construction from GFA, the algorithm:

  1. Partitions the graph (union-find) into weakly connected components (component = a subgraph where all nodes are connected to each other). When assigning paths to jobs, it uses the first segment of each path to determine which job/component it belongs to (but later nodes in those paths may belong to different components).
  2. When building each job's GBWT, it traverses the entire path including all nodes, not just the nodes in that component.
  3. Processes each component as a separate "source"
  4. Assumes during merging that each node appears in exactly one source, and when the constructor explicitly for this, it results in the error.

Because the GBWT construction algorithm expects each node to be unique to a single partition/source, but in this pangenome graph, nodes are naturally shared across many paths representing different samples/haplotypes.

How do I construct a .gfa file out of a .gbz file?

I also am likely misunderstanding something here: isn't it fairly natural for many nodes to be shared across many paths?

It looks like this command works without an error:

../vg gbwt -G ../hprc-v1.1-mc-grch38.gfa --gbz-format -g ../hprc-v1.1-mc-grch38.gbz

Any guidance would be greatly appreciated!

vg • 182 views
ADD COMMENT
1
Entering edit mode
1 day ago
Jouni Sirén ▴ 580

GBWT construction partitions the graph into weakly connected components. Then it forms construction jobs that consist of one or more components and and assigns the paths to a job according to the first node. If the input GFA is consistent, with the paths going from one node to another only if there is an edge between them, the construction jobs will be completely non-overlapping. That is because all nodes in a path will then be from the same component. We can then build GBWTs for each job independently and merge them quickly to get the final GBWT.

On the other hand, if the GFA is not consistent, the merge will fail with an error message like you had. This can plausibly be caused by a single bit flip that changes one node identifier in one path, either in the file itself or when it is cached in memory.

I recommend you to download the PGGB graph again and then run the vg gbwt command with option -p. If it still fails, include the entire output of the command in the error report.

ADD COMMENT
0
Entering edit mode

Interestingly, I redownloaded the file and got a different error.

Before re-downloading, I saw lots of print outputs referring to jobs of various sizes before crashing on the merge step. After redownloading:

Building input GBWTs
Input type: GFA
Opening GFA file ../hprc-v1.0-pggb.gfa
Validating GFA file ../hprc-v1.0-pggb.gfa
Found 110884673 segments, 154756169 links, 34796 paths, and 0 walks in 416.696 seconds
Storing generic named paths as sample _gbwt_ref
GBWT insertion batch size: 241084980 nodes
Parsing segments
Breaking segments into 1024 bp nodes
Parsed 115456068 nodes in 203.57 seconds
Parsing links
Parsed 159327564 edges in 111.204 seconds
Creating jobs
Created 40 jobs for 1346 components in 301.6 seconds
Parsing metadata
terminate called after throwing an instance of 'std::runtime_error'
  what():  MetadataBuilder: Invalid haplotype field JAHBCA010000258.1

Crash report for vg v1.61.0 "Plodio"
Stack trace (most recent call last):
#16   Object "/home/group/user/graph/vg", at 0x623c34, in _start
#15   Object "/home/group/user/graph/vg", at 0x20e46d6, in __libc_start_main
#14   Object "/home/group/user/graph/vg", at 0x20e2e39, in __libc_start_call_main
#13   Object "/home/group/user/graph/vg", at 0xe1721b, in vg::subcommand::Subcommand::operator()(int, char**) const
#12   Object "/home/group/user/graph/vg", at 0xd263ed, in main_gbwt(int, char**)
#11   Object "/home/group/user/graph/vg", at 0xd241c7, in step_1_build_gbwts(vg::GBWTHandler&, GraphHandler&, GBWTConfig&)
#10   Object "/home/group/user/graph/vg", at 0x16c34e1, in gbwtgraph::gfa_to_gbwt(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, gbwtgraph::GFAParsingParameters const&)
#9    Object "/home/group/user/graph/vg", at 0x16b61a1, in gbwtgraph::parse_metadata(gbwtgraph::GFAFile const&, std::vector<gbwtgraph::ConstructionJob, std::allocator<gbwtgraph::ConstructionJob> > const&, gbwtgraph::MetadataBuilder&, gbwtgraph::GFAParsingParameters const&)
#8    Object "/home/group/user/graph/vg", at 0x16b2b27, in gbwtgraph::GFAFile::for_these_path_names(std::vector<char const*, std::allocator<char const*> > const&, std::function<void (std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&)> const&) const
#7    Object "/home/group/user/graph/vg", at 0x5ab2bf, in gbwtgraph::MetadataBuilder::add_path(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, unsigned long) [clone .cold]
#6    Object "/home/group/user/graph/vg", at 0x201cc58, in __cxa_throw
#5    Object "/home/group/user/graph/vg", at 0x201caf6, in std::terminate()
#4    Object "/home/group/user/graph/vg", at 0x201ca8b, in __cxxabiv1::__terminate(void (*)())
#3    Object "/home/group/user/graph/vg", at 0x5f006b, in __gnu_cxx::__verbose_terminate_handler() [clone .cold]
#2    Object "/home/group/user/graph/vg", at 0x5f27b3, in abort
#1    Object "/home/group/user/graph/vg", at 0x20fb8b5, in raise
#0    Object "/home/group/user/graph/vg", at 0x212840c, in __pthread_kill
ERROR: Signal 6 occurred. VG has crashed. Visit https://github.com/vgteam/vg/issues/new/choose to report a bug.
Please include this entire error log in your bug report!

This is with the exact same command as before. I downloaded https://s3-us-west-2.amazonaws.com/human-pangenomics/pangenomes/freeze/freeze1/pggb/hprc-v1.0-pggb.gfa.gz, ran gunzip, then ran the same vg gbwt command.

ADD REPLY
1
Entering edit mode

The HPRC v1.0 PGGB graph is an old graph, from before path naming conventions were fully established (see the vg wiki).

The full name of the offending path is HG00438#2#JAHBCA010000258.1#MT, which does not match any of the patterns we expect. The only name format with three # separators should have an integer as the last field. After a few failed parsing attempts, the name matches the (slightly too general) PanSN regex, with HG00438#2 as the sample name, JAHBCA010000258.1 as the haplotype number, and MT as the contig/sequence name. And then it fails, because the haplotype number is not a number.

Someone had probably renamed the paths in the old GFA file to match the current conventions. You need to do it again to be able to parse the GFA and build the GBZ. Or, if you compiled vg from source, you could go to deps/gbwtgraph/src/gfa.cpp and change GFAParsingParameters::PAN_SN_REGEX to require that the middle field is an integer.

ADD REPLY

Login before adding your answer.

Traffic: 1535 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6