I have a collection of ~100 4.5 megabase haploid assemblies that I would like to map to using giraffe. However, I am not completely clear on what the best practices are to construct the graph starting from the assemblies. I have used PGGB to create a GFA with haplotype information, but from the wiki and previous biostars responses vg autoindex --giraffe
only works from a VCF + Ref and does not currently support working from a GFA with haplotype information.
I have considered a few options:
- Manually create all of the indexes for giraffe using the commands found here: https://github.com/vgteam/vg/wiki/Index-Types
- Use
vg deconstruct
to create a VCF containing all variation in the PGGB GFA graph relative to one reference, and then use that VCF + ref FASTA to runvg autoindex --giraffe
. - Use an alternative method to create a VCF from assemblies, although I am not sure which method would be best for this.
I would appreciate any advice on which of these options are best, or for any advice in general about what would be the best practice when constructing graphs directly from haploid assemblies.
Thank you for the help! This is very useful. I am a little confused about the terminology for paths though. Each of my reference sequences used to construct the graph are haploid assemblies from homozygous cell lines, so each reference sequence path is also a full haplotype path. Does this mean my P-lines (reference paths) should be duplicated as W-lines (haplotype paths) if I want to use the W-lines feature with GFA? Or do haplotype paths have their own distinct meaning here?
A reference path in VG terminology is a path that provides a coordinate system. If you want to use downstream tools based on linear sequences, you have to project the alignments from the graph to a reference sequence. It is often assumed that there is one reference path in each graph component.
Haplotype paths are additional paths that inform some VG algorithms which alignments are likely to be true.
We usually assume that reference paths are synthetic sequences, while haplotype paths are true haplotypes. If you have only true haplotypes, they should all be either P-lines or W-lines. With P-lines, you have to use regular expressions for parsing GBWT path names from GFA path names. With W-lines, the required fields already match closely to GBWT path name components. If you are using only one line type, GBWT does not know which set of paths is supposed to be the reference. Hence you have to specify
--ref-sample
when converting GBZ to XG, regardless of whether you are using P-lines or W-lines.