Hello VG Team,
I have been contemplating whether there is a more efficient approach to perform population-level Structural Variants (SVs) calling using the VG calling pipeline.
Based on my understanding, the VG calling pipeline consists of the following steps: 1) vg construct
to create the graph.vg; 2) vg index
to generate indexes xg
and gcsa
; 3) vg map
to produce the mapping file gam
; 4) vg augment
for creating the augmented graph; 5) index and map execution against the augmented graph; 6) finally, conducting vg call
.
The issue is that this pipeline takes at least 7.5 days for just one sample. As I need to process multiple samples, I am considering randomly selecting one sample from each group to execute the pipeline, and then use the combined VCF from vg call
to create a new graph.vg. The remaining samples can then use this newly generated graph to only execute steps 1), 2), 3), and 6).
I would appreciate your input on whether this plan is appropriate. Are there any other methods that I may not be aware of that can efficiently solve this problem?
Additionally, I am curious about the quality of the new SVs generated in step 3). If the quality is not satisfactory, I am thinking of skipping the vg augment
step for all the samples.
Thank you for your assistance.
Best regards, Maxine
These days most users are opting for
vg giraffe
instead ofvg map
for short read mapping. Its speed is closer to what people expect from a tool likebwa mem
. You can also get away without augmenting the graph if your primary interest is structural variants. For small variants, you can get better performance by projecting the graph mappings to a linear reference usingvg surject
and then usingDeepVariant
(you can see this analysis in the main HPRC paper if you want a model).I am highly interested in using giraffe, but I encountered two challenges: my VCF used to construct the graph is unphased, and more importantly, the vg autoindex -giraffe process consistently fails due to out-of-memory issues. Therefore, I have the following questions:
vg giraffe
on unphased inputs to give a firm yes or no. My guess is that it depends on the variant density. If your graph frequently has several variants within the span of a read length, you'll probably hurt for the lack of phasing. If not, it might be okay.vg giraffe
norvg autoindex
are particularly light on memory use, but I would be surprised ifvg autoindex
used much more memory thanvg giraffe
. If you're interested in pursuing a manual pipeline, you can make a graph withvg construct -a
and then index it as a GBWT withvg gbwt
. There are some suggestions for how to do that in this guide.