Question

SV calling using giraffe/vg

0

Entering edit mode

23 months ago

TN • 0

It's not very clear to me how exactly VG or Giraffe carries out SV calling (if it truly does). The tutorials mention how we can do genotyping using existing SV calls, but I was wondering if VG or Giraffe can actually identify novel SVs in genomes that were not used to construct the VG graph.

vg Giraffe calling structural variant • 3.3k views

ADD COMMENT • link updated 17 months ago by Jordan M Eizenga ▴ 660 • written 23 months ago by TN • 0

2

Entering edit mode

Technically, VG can identify some novel SVs, but its strength is in genotyping known SVs. To detect novel variants, VG relies on them being present in the alignments. Short read alignments usually can find novel small variants, but novel SVs tend to cause alignment and mapping problems. The maximum size SV it can detect is roughly the read length, but the probability of detection drops off pretty sharply with SV size.

ADD REPLY • link 23 months ago by Jordan M Eizenga ▴ 660

0

Entering edit mode

Thank you very much for the response!

The strength of Giraffe is indeed lies in fast pangenome mapping, which can improve downstream SV calling and genotyping compared to using a single linear reference genome. e.g., SV calling can be better for insertions where PE reads cannot be mapped to a linear reference genome, and genotyping can be better as now we have information about multiple variants (collected from multiple samples) within a pangenome graph.

A (sub-)figure in the "Research Article Summary" page for the Siren et al. (2021) Science paper is titled "Call variants and genotype" and I was wondering how exactly variant calling was performed for SVs for this paper. I couldn't exactly locate that information and based on my understanding the work focused on genotyping of known SVs collected from other sources like TOPMed (and not on SV calling per se) as emphasized by the term "known" in the title "Pangenomics enables genotyping of known structural variants in 5202 diverse genomes". Please correct me if my interpretation is wrong.

Can I please know what SV detection technique (e.g. discordant PE read mapping, split read mapping, read depth) is VG/Giraffe using for SV calling (not the mapper, but the "call" function in VG/Giraffe), and if the calling precision/recall/time is comparable to that of standard SV callers like Manta and Lumpy when a standard human reference genome is used (not a pangenome reference)? I'm not sure of that makes sense or not, i.e. using a linear reference with VG/Giraffe. If not, an equivalent limiting case would be using the same sequence while traversing different paths within the pangenome graph (by using the same sequence as alternate paths at any locus).

I entirely agree with the deficiencies you pointed for the novel SV detection process using short-reads, and I think using a pangenome reference can reduce the reference allele bias (even though other shortcomings related to the limited insert size will still remain).

Also, my understanding is that, to improve SV calling on new genomes (from short-read sequencing) using pangenome reference, and to detect novel SVs, I should be carrying out the following steps:

Create a pangenome graph (using VG/Giraffe map) from available genome sequences for which good quality SV calls have already been made using other callers, or grab the graphs you provided (generated from SV catalogs from long read data).
Map new genomes (for which I want to detect SVs) to the above pangenome graph using VG/Giraffe to get gam files.
Convert the above gam files to linear bam files
a) Continue with the usual SV calling steps with traditional callers using the above bam file, but I'm not sure what linear reference file to use, OR b) Create an augmented pangenome graph using the new sequences and use "vg call" to call the novel variants.
To improve genotyping, I can create a new graph using the SV catalog generated in step (4) and use "vg call".

Please correct any inaccurate statements above (particularly step 4).

Thank you very much!

ADD REPLY • link 23 months ago by TN • 0

1

Entering edit mode

I think the distinction between "calling" and "genotyping" in that sentence is that calling a variant involves identifying the presence of a non-reference variant, whereas genotyping also involves calling it as heterozygous or homozygous.

Among the methods you listed, the SV calling method in VG is most similar to the split-read method. However, the alignments typically aren't really "split" in a pangenome graph, they just align to a non-reference walk. For more details, you could check out our SV calling paper: https://link.springer.com/article/10.1186/s13059-020-1941-7

The pipeline you're suggesting differs from ours at steps 4) and 5). You can look at @Medhat's reply above to see how to call variants in VG without converting to a BAM. That said, the alignments you get from a VG-generated BAM will often align more accurately around SVs than a BAM from a more traditional aligner, which could make it easier to detect SVs with a linear reference. I'm not sure if anyone has benchmarked that type of pipeline.

ADD REPLY • link 23 months ago by Jordan M Eizenga ▴ 660

0

Entering edit mode

Thank you very much for the detailed response. Is there a way I can integrate a standard SV calling pipeline (e.g. using a conventional SV caller like Lumpy, that I use with bam files mapped to linear references) using this pangenome graph, or I must use "vg call" for variant calling?

ADD REPLY • link 23 months ago by TN • 0

1

Entering edit mode

According to the documentation, there are two ways to call SVs: 1- Call only variants that are present in the graph.

# Compute the read support from the gam
# -Q 5: ignore mapping and base qualitiy < 5
# -s 5: ignore first and last 5bp from each read
vg pack -x x.xg -g aln.gam -Q 5 -s 5 -o aln.pack

# Generate a VCF from the support.  
vg call x.xg -k aln.pack > graph_calls.vcf

2- And consider novel variants from the reads, use the augmented graph and gam using vg augment -A:

# Index our augmented graph
vg index aug.vg -x aug.xg

# Compute the read support from the augmented gam (ignoring qualitiy < 5, and 1st and last 5bp of each read)
vg pack -x aug.xg -g aug.gam -Q 5 -s 5 -o aln_aug.pack

# Generate a VCF from the support
vg call aug.xg -k aln_aug.pack > calls.vcf

Source

ADD REPLY • link 23 months ago by Medhat 9.8k

0

Entering edit mode

Thank you very much! I have already gone through these instructions in the documentation but do not have much idea about the SV calling process with "vg call". I'd appreciate it if you can elaborate on the SV detection technique (e.g. discordant PE read mapping, split read mapping, read depth) being used.

ADD REPLY • link 23 months ago by TN • 0

0

Entering edit mode

I have doubts about the vg augment -A mentioned in Point 2. Based on my understanding, this command should be executed as follows:

vg augment -i -S x.vg x.gam -A x.aug.gam > x.aug.vg

Running vg autoindex -giraffe does not generate a VG-formatted pangenome. Should I use vg construct to build one as input for vg augment?
When I run vg augment -i -S, it crashed. However, it ran successfully when I remove these two options. Are these two parameters essential for the subsequent vg call?

ADD REPLY • link 17 months ago by Maxine ▴ 50

1

Entering edit mode

To make sure you have a VG graph that matches your alignments, you should directly convert the graph you mapped to like this:

vg convert graph.gbz > graph.vg

Also, the -i and -S flags aren't standard best practice. Most users will want to only use -m 2 -q 10.

ADD REPLY • link 17 months ago by Jordan M Eizenga ▴ 660