Using VG to genotype a GAM using a graph constructed from GFA
1
2
Entering edit mode
3.2 years ago
Michael ▴ 30

I have a graph that I created from a group of haploid assemblies using PGGB. What I would like to do is use VG to align short reads from new samples to the graph using giraffe, and then genotype the samples. My end goal is to have the genotype calls be consistent with the haplotypes used to construct the graph (originally existed as W-lines in the PGGB GFA, and later as GBWT threads). I have included below the workflow that I thought would work which constructs the graph from GFA, uses vg deconstruct to create a VCF with the genotypes of all the GBWT threads, then uses vg pack and vg call to genotype a .gam based on that VCF. Unfortunately, the final command using vg call gives me the error

[VCFTraversalFinder] Warning: No alt path (prefix=_alt_6162647cdc0e686937be7380951cff5969dced64_) found in graph for variant.  It will be ignored:

for every variant. From the documentation I can see that this is caused by the fact that the graph was not constructed from a FASTA + VCF with the -a flag, so there are no alt paths in the graph.

As solutions, I attempted to use vg call with the -a or -g flags, which both worked to genotype the snarls in the graph. However, different samples produced different numbers of calls, and the POS/REF/ALT determination at each snarl was not completely consistent with the VCF created using vg deconstruct making it difficult to harmonize the calls.

Is there any workflow that I could use to provide a VCF to vg call to allow for harmonized calls, given that I started with a graph constructed from GFA? Such as some way to turn the GBWT threads into alt paths?

#Starting from PGGB GFA output
#Reference path stays as a P-line
#P-lines for haplotypes modified to match (sample)_(contig)_(haplotype) for conversion to W-lines
vg convert \
    -g graph_plines.gfa\
    -f \
    -w _ \
    -t 12 > graph.gfa

#Create indexes from GFA
vg autoindex \
    -w giraffe \
    -g graph.gfa \
    -t 12 \
    -V 1 \
    -T /data/vg_temp

#Extract GBWT and GBWTGraph from GBZ for later user with deconstruct
vg gbwt \
    -o graph.gbwt \
    -g graph.gg \
    -Z graph.giraffe.gbz

#Create XG from GBZ for later use with deconstruct/pack/call
vg convert \
    -Z graph.giraffe.gbz \
    -x > graph.xg

#Create a VCF containing genotypes of all haplotypes
vg deconstruct graph.xg \
    -p PGF \
    -g graph.gbwt \
    -a \
    -t 12 > graph.vcf

#Align short reads using giraffe
vg giraffe \
    -Z graph.giraffe.gbz \
    -m graph.min \
    -d graph.dist \
    -p \
    -f sample.1.fq \
    -f sample.2.fq > sample.gam

#Create packed read support, ignoring reads with MAPQ < 5 (-Q 5)
vg pack \
    -x graph.xg \
    -o graph.pack \
    -g sample.gam \
    -Q 5

#Genotype variants in graph VCF
vg call graph.xg \
    -p PGF \
    -k sample.pack \
    -v graph.vcf \
    -s sample \
    -t 12 > sample.vcf
vg vgteam • 2.7k views
ADD COMMENT
1
Entering edit mode
3.2 years ago
glenn.hickey ▴ 520

You are correct that vg call -v only works with graphs generated with vg construct -a.

But vg call -a -g should do what you want here. -a should guarantee that each invocation of vg call on the same graph will genotype the same sites. And -g <gbwt> will restrict possible alleles to those in the GBWT index (which is what deconstruct uses).

ADD COMMENT
1
Entering edit mode

Thank you for the response. I thought I had it working, however once I looked deeper into the results I realized that vg call -a -g is not quite getting all of the same snarls as vg deconstruct -a -g. For vg deconstruct -a -g, it seems to be creating an entry for every nested snarl just as I would expect. These are reported relative to the reference path when possible, however some nested snarls it reports relative to another GBWT thread. I guess because these are SNPs within large insertions relative to the ref path it does not report them relative to that path. However, for vg call -a -g, it only genotypes the snarls that vg deconstruct reports relative to the ref path. All of the other nested snarls are not genotyped. Is this intended? It leaves me with a lot of positions that I cannot genotype if my sample has one of these insertions.

ADD REPLY

Login before adding your answer.

Traffic: 1642 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6