Hello all,
We are reaching out since we have some practical questions regarding variant calling and analyses of short-read data mapped to a pangenome graph. We are working on a project aimed to describe the diversity of a fungal plant pathogen (genome size 50 Mb, haploid) using a pangenome reference. We have successfully executed the initial steps of our workflow, but we've encountered a hurdle in the latter stages of variant calling and genotyping.
Project Workflow: Our project's workflow is structured as follows:
Pangenome Graph Construction: We employed PGGB (Pan-Genome Graph Builder) to create a comprehensive pangenome graph. This graph incorporates different long-read based genome assemblies (chromosome-level) of different strains to capture the extensive variation that exists within this species.
Short-Read Alignment: We used vg giraffe to align our short-read data from several genetically distinct strains to the PGGB pangenome graph successfully.
The Challenge:
The challenge arises as we aim to analyse to short-read data mapped to the graph. We basically aim to achieve two interconnected objectives:
- We aim to genotype the variants present in the graph using the aligned short-read data. We are trying to use variant calling tools such as GATK and FreeBayes. However, we are not sure what would be the best way to move forward. First, we could extract a vcf file that contains the variation in the graph (vg deconstruct), but we are unsure how we can cross-reference these with the output of vg giraffe, i.e.g, the short read mapping would need to be integrating to genotype the variants per sample. Second, we could use the bam file of vg giraffe to generate variant calls using GATK/FreeBayes but we are not sure how these then would correspond to the variation present in the graph. We think the first option seems to be the right way forward but we are not sure how to practically approach this as the documentation in this regard is, at least for us, very cryptic.
- We would like to use the read mapping of the short-reads to the pan-genome to give us an impression which nodes in the pan-genome graph are conserved by all other genotypes, i.e., determine the mapping depth and breadth per node in the graph for each sample individually. We did attempt to do this using gaf file from vg giraffe as it report the mapping per node. However, we have a hard time to understand how the mapping positions in a node correspond to the mapping on the path (as seen by the bam file). Specifically, we see many very short mappings on the nodes that do not seem to correspond to the single mapping in the bam file. There might be other ways to obtain the information we need, but we would like to understand how the mapping locations relate to each other. Is the a better way to achieve what we aim to get out of the graph and the mapping? Ideally, we would like to get per node in the graph, how many nucleotides (and which nucleotides) are covered by how many short reads.
We would greatly appreciate insights, advice, or recommended strategies to address the challenges we are facing with variant calling and genotyping within the context of a pangenome.
Cheers,
Kyran & Michael
There's a lot to respond to here, but I'll do my best.
Re: Genotyping If your goal is just to call small variants, you can use
vg giraffe
's BAM output directly for that. The VG developers have collaborated with DeepVariant developers, and that is probably the most optimized such workflow, but GATK and FreeBayes should work as well. You will have no guarantee that the variants they call correspond in a simple way to the graph content. This is partly because VCF is not a good format for complex variation, so the output (variants) are less expressive than the input (graph). Another part is that VCF normalization is actually a pretty hard problem.bcftools norm
will help a bit, but it doesn't do much for complex variation. If 1) your graph is constructed originally from VCF input and 2) you want to call variants withvg call
, then you can use thevg construct -a
andvg call -v
options to have the output VCF match the input VCF. Unfortunately, that doesn't apply to your situation, since you are constructing the graph from assembly input and you are interested in different callers.Re: Node coverages.
If you want some more documentation on GAF, the specification is here. The node mappings may not correspond to any position on the paths you report the BAM against because not all paths contain all nodes. In addition, VG typically imposes a maximum length for nodes (for technical computational reasons). Because of this, it's common for an alignment to span several nodes, each of which has only a short subsection of the full alignment.
Fortunately, there is a tool in VG that can do what you're asking with node coverages. Take a look at
vg pack -d -a
. Some of the other options invg pack
might also be relevant to you.