For a graph built with Minigraph-Cactus, is there any way to determine which portions of a non-reference path wouldn't be able to be surjected to the primary reference coordinate system?
As a theoretical example, if I were to take short reads (150-bp, paired, with zero errors) generated from one of the assemblies included in the graph as a non-reference path, align them to the graph with vg giraffe and output a BAM file in the coordinate system of the primary reference, some sites will inevitably be missing (e.g., because the non-reference path includes a sufficiently large alternate allele at a locus). Is there a way to identify such non-surjectable sites in both the primary reference and non-reference paths with respect to one another, just from the graph, rather than proceeding through alignment and variant calling?
thanks! I have used
odgi pav
, but didn't think of it as a potential solution here. If I were to use that, I think what I would be looking for is missing sequence in the primary reference relative to each path, and I haven't had much luck setting a non-reference path as a reference for that purpose. I'm also interested in a genome-wide estimate of these types of variants, and I'm not sure that this approach would be efficient for that? (I am working with pretty large and often messy plant genomes.)Maybe what I'm actually looking for are nodes that are not traversed by the primary reference and that are of sufficient length that surjection for the purpose of calling SNPs would not work. When I call SNPs/MNVs/INDELs from graph-aligned reads, I surject to the primary reference and keep called variants <50 bp, because in theory, variants >=50 bp are caught by
vg call
in a separate process. So nodes >=50 bp in length that would be fully covered by reads (i.e., that entire variant is present in the reseq sample) would produce non-surjectable alignments? Am I thinking about that correctly?