Question

Parameter Settings for Haplotype-Aware Aligner

0

Entering edit mode

19 days ago

chiyong4783 • 0

I have constructed a graphical pan-genome using seven genomes, and I plan to use it for haplotype sampling. How should I set the k-node subpaths? The default value is 4.

vg • 493 views

ADD COMMENT • link updated 17 days ago by GenoMax 148k • written 19 days ago by chiyong4783 • 0

score 3 · Accepted Answer · 2024-12-03

3

Entering edit mode

18 days ago

Jouni Sirén ▴ 510

There are two different methods used with Giraffe that have been called haplotype sampling:

Proportional sampling of local haplotypes. This is mostly used with VCF-based graphs with many haplotypes to get rid of rare variants. By default, vg autoindex uses it if the graph contains more than 192 haplotypes.
Creation of a personalized pangenome reference based on k-mer counts in the reads.

The k-node subpaths are used in proportional sampling, and nobody has really tried other values than the default k = 4. But because your graph is based on only 7 genomes, you should use them directly instead of downsampling them to a smaller number of synthetic haplotypes.

You can find documentation for creating personalized references in the vg wiki.

ADD COMMENT • link 18 days ago by Jouni Sirén ▴ 510

0

Entering edit mode

Thank you for your response. I've successfully built a pan-genome consisting of 7 fugu genomes using MC. Now, I'm planning to find more SVs by adding short-read sequence data using vg. Should I perform haplotype sampling or allele frequency filtering? Based on the content from the VG Giraffe best practices document :"With a small number of haplotypes (e.g. 10), the default graph is usually a good choice." For such simple graphs, is the default already sufficient? (haplotype sampling or allele frequency filtering is not applie) --clip

ADD REPLY • link 18 days ago by chiyong4783 • 0

1

Entering edit mode

Sampling and frequency filtering both require a larger number of haplotypes to do anything meaningful.

The basic idea behind sampling and filtering is that if a variant is present both in the reference graph and the sequenced genome, its presence in the graph will make read mapping more accurate. But if a variant is present in the reference but not in the sequenced genome, it can be misleading and lower the accuracy. In general, the former effect is greater than the latter.

Proportional sampling and frequency filtering both aim to create a single universal reference that includes common variants, excludes rare variants, and may include or exclude variants that are moderately common. But if you don't have enough haplotypes, you can't tell the difference between common and rare variants reliably enough, and these approaches will likely fail to improve the accuracy.

In contrast, a personalized reference tries to include variants that are present in the sequenced genome and exclude variants that are not present. But because this is done at the level of 10 kbp blocks rather than individual variants, you need enough haplotypes to have a reasonable chance of finding good enough haplotypes for most blocks.

If you have a small number of genomes in the graph, none of these techniques will likely improve the accuracy.

ADD REPLY • link 18 days ago by Jouni Sirén ▴ 510

0

Entering edit mode

Thank you so much for your response! It gave me a much better understanding. You're awesome!