Entering edit mode
23 months ago
TN
•
0
It's not very clear to me how exactly VG or Giraffe carries out SV calling (if it truly does). The tutorials mention how we can do genotyping using existing SV calls, but I was wondering if VG or Giraffe can actually identify novel SVs in genomes that were not used to construct the VG graph.
Technically, VG can identify some novel SVs, but its strength is in genotyping known SVs. To detect novel variants, VG relies on them being present in the alignments. Short read alignments usually can find novel small variants, but novel SVs tend to cause alignment and mapping problems. The maximum size SV it can detect is roughly the read length, but the probability of detection drops off pretty sharply with SV size.
Thank you very much for the response!
The strength of Giraffe is indeed lies in fast pangenome mapping, which can improve downstream SV calling and genotyping compared to using a single linear reference genome. e.g., SV calling can be better for insertions where PE reads cannot be mapped to a linear reference genome, and genotyping can be better as now we have information about multiple variants (collected from multiple samples) within a pangenome graph.
A (sub-)figure in the "Research Article Summary" page for the Siren et al. (2021) Science paper is titled "Call variants and genotype" and I was wondering how exactly variant calling was performed for SVs for this paper. I couldn't exactly locate that information and based on my understanding the work focused on genotyping of known SVs collected from other sources like TOPMed (and not on SV calling per se) as emphasized by the term "known" in the title "Pangenomics enables genotyping of known structural variants in 5202 diverse genomes". Please correct me if my interpretation is wrong.
Can I please know what SV detection technique (e.g. discordant PE read mapping, split read mapping, read depth) is VG/Giraffe using for SV calling (not the mapper, but the "call" function in VG/Giraffe), and if the calling precision/recall/time is comparable to that of standard SV callers like Manta and Lumpy when a standard human reference genome is used (not a pangenome reference)? I'm not sure of that makes sense or not, i.e. using a linear reference with VG/Giraffe. If not, an equivalent limiting case would be using the same sequence while traversing different paths within the pangenome graph (by using the same sequence as alternate paths at any locus).
I entirely agree with the deficiencies you pointed for the novel SV detection process using short-reads, and I think using a pangenome reference can reduce the reference allele bias (even though other shortcomings related to the limited insert size will still remain).
Also, my understanding is that, to improve SV calling on new genomes (from short-read sequencing) using pangenome reference, and to detect novel SVs, I should be carrying out the following steps:
Please correct any inaccurate statements above (particularly step 4).
Thank you very much!
I think the distinction between "calling" and "genotyping" in that sentence is that calling a variant involves identifying the presence of a non-reference variant, whereas genotyping also involves calling it as heterozygous or homozygous.
Among the methods you listed, the SV calling method in VG is most similar to the split-read method. However, the alignments typically aren't really "split" in a pangenome graph, they just align to a non-reference walk. For more details, you could check out our SV calling paper: https://link.springer.com/article/10.1186/s13059-020-1941-7
The pipeline you're suggesting differs from ours at steps 4) and 5). You can look at @Medhat's reply above to see how to call variants in VG without converting to a BAM. That said, the alignments you get from a VG-generated BAM will often align more accurately around SVs than a BAM from a more traditional aligner, which could make it easier to detect SVs with a linear reference. I'm not sure if anyone has benchmarked that type of pipeline.
Thank you very much for the detailed response. Is there a way I can integrate a standard SV calling pipeline (e.g. using a conventional SV caller like Lumpy, that I use with bam files mapped to linear references) using this pangenome graph, or I must use "vg call" for variant calling?
According to the documentation, there are two ways to call SVs: 1- Call only variants that are present in the graph.
2- And consider novel variants from the reads, use the augmented graph and gam using
vg augment -A
:Source
Thank you very much! I have already gone through these instructions in the documentation but do not have much idea about the SV calling process with "vg call". I'd appreciate it if you can elaborate on the SV detection technique (e.g. discordant PE read mapping, split read mapping, read depth) being used.
I have doubts about the
vg augment -A
mentioned in Point 2. Based on my understanding, this command should be executed as follows:Running
vg autoindex -giraffe
does not generate a VG-formatted pangenome. Should I usevg construct
to build one as input forvg augment
?When I run
vg augment -i -S
, it crashed. However, it ran successfully when I remove these two options. Are these two parameters essential for the subsequentvg call
?To make sure you have a VG graph that matches your alignments, you should directly convert the graph you mapped to like this:
Also, the
-i
and-S
flags aren't standard best practice. Most users will want to only use-m 2 -q 10
.