Greetings all,
I'm currently working on a project wherein I would like to use the HPRC draft pangenome as the basis to describe variation at certain classes of structural variants across a broad set of human genomes. The concept is simple: use odgi pav
to determine which samples have/lack a given sequence across a set of known SV loci in order to gain insight into the underlying biology at these sites.
Since we are only concerned with SVs, we would not benefit from nucleotide-level variation captured in the Minigraph-Cactus (MC) graphs from HPRC. We plan to build onto the HPRC graph by adding additional linear genomes, and would like to leverage the speed and efficiency of Minigraph, or rather, avoid having to run the entire MC pipeline when we don't require that level of detail. However, I've encountered a problem, which apparently stems from one key limitation of rGFA in its current form: the lack of encoded haplotypes. The consequence is that, while it is possible to reformat the rGFA graph into GFA 1.0 using vg convert -Wf -r 89
(the -r 89 argument inducing "P" lines for all samples in the final GFA), and subsequently convert this file with odgi build
for processing with odgi pav
, all genomic ranges queried show sequence only in the GRCh38 reference, whereas the same loci show sequence in all samples within the MC graphs. This result is clearly not correct, but might make sense given limitations of the (current) rGFA spec. However, this is not a particularly satisfactory explanation and raises more questions than it answers for me.
First off, I don't quite grasp the reasons why rGFA cannot encode haplotype information. Is this a bug or a feature? Is it possible to somehow "inject" the haplotype information as part of the conversion process, or are there reasons why the rGFA format cannot capture such information? It seems to me that haplotypes can easily be encoded in the stable sequence names, which HPRC seems to do by formatting contig names in SN tags as SAMPLE#HAPLOTYPE#CONTIG. Why is it that these tags are insufficient to reconstruct haplotypes? I'm clearly missing something!
Second, if there is no way to directly-manipulate the rGFA file to capture the necessary haplotypes in the final GFA, would it be possible, without running the full MC pipeline, to process the Minigraph graphs into MC graphs, but without incorporating any additional variants (e.g., by skipping the Cactus alignment step). Would it be possible, for example, to decompose the Minigraph graph into a set of pairwise alignments, which then could be converted into vg/odgi format using the final stages of the MC pipeline? Would it help to align the original assemblies back to the graph as is done in the first stage of the MC pipeline?
Finally, to (hopefully) help my understanding of why I'm seeing the odgi pav
results I am with the Minigraph graphs, I am curious as to what the vg convert -Wf -r 89
command is actually producing. The resulting GFA file has "P" lines for all samples, but apparently these are not equivalent to "P" lines produced when converting the MC graphs into GFA1.0 format. What exactly is "missing" from these path definitions that renders then uninformative as to alignments against GRCh38 within the graph, or are the differences elsewhere in the file?
If it's not technically feasible to tweak the rGFA graph to include haplotypes, I am prepared to run my analysis with the full MC pipeline, but would definitely prefer a solution that uses the Minigraph graphs directly, even if it means taking additional steps to coerce it into a useful form. Given the huge difference in computational requirements between Minigraph and MC graph building, I would think such a workflow would be of great general use, particularly for those studying SVs.
That's all for now. Hoping someone can chime in with some answers!
Firstly, this blog post might help you understand a bit more on this very complex topic. https://ekg.github.io/2019/07/09/Untangling-graphical-pangenomics
Second - thanks for the documentation of the ability to convert between rGFA and GFA1.0. I did not know this is possible - but as you say, I'm not sure it is technically complete or useful (I haven't tried it yet).
Third - if you add the tags vg and vgteam you may get an answer from the VG and/or cactus team.
My personal feeling is you'll have to re-run minigraph-cactus to get to where you want. Keep in mind - and see some comparison papers - that graphs generated by Minigraph are very different to those generated by Minigraph-Cactus (and also PGGB).
Lastly - I had similar impressions on this whole conglomerate of tools. I think I needed to re-align genomes to the minigraph rGFA graph with minigraph to obtain some form of haplotype information, which was .... suboptimal and confusing - with output in the GAF read mapping format. The VG and ODGI PAV routes were far more convincing and understandable for us.
I also maintain this list which may be helpful : https://github.com/colindaven/awesome-pangenomes
Thank you for the reply! The link to Erik Garrison's blog post on the subject looks like it will be very helpful to address gaps in my technical knowledge and I look forward to reading it in detail.
As of right now, I've as much as made the decision to proceed with the full MC pipeline on our data. While this will mean extra time and computation, on balance, it probably will go more quickly and smoothly than trying to identify the minimum set of operations to transform (not sure convert is the right word!) rGFA to an MC-equivalent GFA1.0, if that even turns out to be possible.
Upon reflection, I think the key would be mapping assemblies back to the Minigraph graph and somehow coercing the results back into a graph format. As you say, though, this is likely to be suboptimal and confusing.
Thank you for maintaining the list of pangenome tools. I came across it in my browsing here before posting the question and look forward to reading up on them. Sadly, I don't see graph equivalents of bedtools and samtools under development yet, which I find somewhat surprising!