Does vg call allow fuzzy match during SV calling?
1
0
Entering edit mode
10 months ago
Maxine ▴ 50

Hi vg team,

I'm curious whether vg call can accommodate some level of fuzzy alignment, identifing this fuzzy alignment as a known SV within the pangenome. If this is possible, which parameters can be adjusted to set the threshold?

Considering a large SV, spanning hundreds or even thousands of bases, it's unlikely to be identical at every base if it first appeared in a population a long time ago due to genetic drift. How would vg call handle a sequence that mostly aligns to a path except for a few bases?

Maxine

vg • 945 views
ADD COMMENT
1
Entering edit mode
10 months ago

VG itself does not currently perform this sort of fuzzy matching. If you wanted to build fuzzy matching into the analysis, it would have to happen outside VG: either by combining similar SVs for a VCF that is provided to vg construct, or alternatively in combining similar alleles from the output of vg call. It's a challenging problem though. There's a moderate-sized literature on tools to merge SVs, and as far as I can tell, none are universally better than the rest.

ADD COMMENT
0
Entering edit mode

You said:

or alternatively in combining similar alleles from the output of vg call

Does it imply that in a variant calling pipeline that doesn't perform augmentation, it cannot identify new loci, however, it has the capability to assign new allele. For instance, for a bi-allelic locus (0/1) in ref pangenome, a sequence that doesn't match either allele 0 or 1 will be assigned to 2. Is that what you are suggesting?

ADD REPLY
0
Entering edit mode

The criterion that vg call uses to assign alleles is exact sequence identity. If the graph has nested small variants within the SV, they can lead to distinct alleles for the SV in the VCF that vg call creates, regardless of whether you are using augmentation.

ADD REPLY
0
Entering edit mode

What about a sequence that is 99% identical to a certain path in the graph? Despite being so similar, there are a few base mismatches. How would vg handle this situation?

ADD REPLY
0
Entering edit mode

They would be reported as separate alleles

ADD REPLY
0
Entering edit mode

That's great news. May I ask if there are any rules for determining this sequence? For instance, a sequence with less than 80% similarity is considered a mismatch, while one with more than 80% similarity is assigned a separate allele symbol. Perhaps the rules are complex, but if there are any documents or articles that mention this, please let me know. Thank you!

ADD REPLY
1
Entering edit mode

Ah, sorry, I think I misunderstood your question. I think there are two situations that we need to distinguish, and I'm not fully sure which one you expect:

  1. There is one SV allele included in the graph, but it harbors additional variants inside (which are not present in the graph). In this case, vg call will only call the reference allele or the SV allele without the nested variants, so the variant will appear to be biallelic. If you want to call the nested variants, you can use vg augment to discover small variants from the reads. If you augment, the site can be reported as a multiallelic SV, where some alleles have very similar sequences.
  2. There is an SV allele represented in the graph along with some nested variants. In this case, there is no need to augment the graph before calling the variants. You can get a multiallelic SV without modification because the very similar alleles are already present in the graph. However, for any other SVs whose nested variants are not present in the graph, the previous case still applies.

The vg call algorithm was originally published in this paper, but I don't think there's detailed documentation there. You might also be interested in this tutorial.

ADD REPLY

Login before adding your answer.

Traffic: 3430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6