I have a graph that I created from a group of haploid assemblies using PGGB. What I would like to do is use VG to align short reads from new samples to the graph using giraffe, and then genotype the samples. My end goal is to have the genotype calls be consistent with the haplotypes used to construct the graph (originally existed as W-lines in the PGGB GFA, and later as GBWT threads). I have included below the workflow that I thought would work which constructs the graph from GFA, uses vg deconstruct
to create a VCF with the genotypes of all the GBWT threads, then uses vg pack
and vg call
to genotype a .gam
based on that VCF. Unfortunately, the final command using vg call
gives me the error
[VCFTraversalFinder] Warning: No alt path (prefix=_alt_6162647cdc0e686937be7380951cff5969dced64_) found in graph for variant. It will be ignored:
for every variant. From the documentation I can see that this is caused by the fact that the graph was not constructed from a FASTA + VCF with the -a
flag, so there are no alt paths in the graph.
As solutions, I attempted to use vg call
with the -a
or -g
flags, which both worked to genotype the snarls in the graph. However, different samples produced different numbers of calls, and the POS/REF/ALT determination at each snarl was not completely consistent with the VCF created using vg deconstruct
making it difficult to harmonize the calls.
Is there any workflow that I could use to provide a VCF to vg call
to allow for harmonized calls, given that I started with a graph constructed from GFA? Such as some way to turn the GBWT threads into alt paths?
#Starting from PGGB GFA output
#Reference path stays as a P-line
#P-lines for haplotypes modified to match (sample)_(contig)_(haplotype) for conversion to W-lines
vg convert \
-g graph_plines.gfa\
-f \
-w _ \
-t 12 > graph.gfa
#Create indexes from GFA
vg autoindex \
-w giraffe \
-g graph.gfa \
-t 12 \
-V 1 \
-T /data/vg_temp
#Extract GBWT and GBWTGraph from GBZ for later user with deconstruct
vg gbwt \
-o graph.gbwt \
-g graph.gg \
-Z graph.giraffe.gbz
#Create XG from GBZ for later use with deconstruct/pack/call
vg convert \
-Z graph.giraffe.gbz \
-x > graph.xg
#Create a VCF containing genotypes of all haplotypes
vg deconstruct graph.xg \
-p PGF \
-g graph.gbwt \
-a \
-t 12 > graph.vcf
#Align short reads using giraffe
vg giraffe \
-Z graph.giraffe.gbz \
-m graph.min \
-d graph.dist \
-p \
-f sample.1.fq \
-f sample.2.fq > sample.gam
#Create packed read support, ignoring reads with MAPQ < 5 (-Q 5)
vg pack \
-x graph.xg \
-o graph.pack \
-g sample.gam \
-Q 5
#Genotype variants in graph VCF
vg call graph.xg \
-p PGF \
-k sample.pack \
-v graph.vcf \
-s sample \
-t 12 > sample.vcf
Thank you for the response. I thought I had it working, however once I looked deeper into the results I realized that
vg call -a -g
is not quite getting all of the same snarls asvg deconstruct -a -g
. Forvg deconstruct -a -g
, it seems to be creating an entry for every nested snarl just as I would expect. These are reported relative to the reference path when possible, however some nested snarls it reports relative to another GBWT thread. I guess because these are SNPs within large insertions relative to the ref path it does not report them relative to that path. However, forvg call -a -g
, it only genotypes the snarls thatvg deconstruct
reports relative to the ref path. All of the other nested snarls are not genotyped. Is this intended? It leaves me with a lot of positions that I cannot genotype if my sample has one of these insertions.