.gfa is a text-based file that contains the structure of a pan-genome graph. I can write a script to parse this file, but it is time consuming due to its size.
However, there are several other formats used by VG. For example, .gbz, .vg, and .xg. These files are all binary, and I can't intuitively understand what information is contained in them or which information can be extracted from them.
I am wondering if there is any way to get the source and sequence for a specific node/segment. The source might indicate which haplotype contains this node.
vg convert
can convert those formats into GFA, andvg chunk
can be used to query small graph regions. However,vg chunk
loads the entire graph into memory for each query. This makes it fast enough for individual interactive queries, but too slow to be very effective as a backend to programmatic queries. There's development currently underway on a more responsive SQL-based query interface here.