I have constructed a graphical pan-genome using seven genomes, and I plan to use it for haplotype sampling. How should I set the k-node subpaths? The default value is 4.
I have constructed a graphical pan-genome using seven genomes, and I plan to use it for haplotype sampling. How should I set the k-node subpaths? The default value is 4.
There are two different methods used with Giraffe that have been called haplotype sampling:
vg autoindex
uses it if the graph contains more than 192 haplotypes.The k-node subpaths are used in proportional sampling, and nobody has really tried other values than the default k = 4. But because your graph is based on only 7 genomes, you should use them directly instead of downsampling them to a smaller number of synthetic haplotypes.
You can find documentation for creating personalized references in the vg wiki.
Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Thank you for your response. I've successfully built a pan-genome consisting of 7 fugu genomes using MC. Now, I'm planning to find more SVs by adding short-read sequence data using vg. Should I perform haplotype sampling or allele frequency filtering? Based on the content from the VG Giraffe best practices document :"With a small number of haplotypes (e.g. 10), the default graph is usually a good choice." For such simple graphs, is the default already sufficient? (haplotype sampling or allele frequency filtering is not applie) --clip
Sampling and frequency filtering both require a larger number of haplotypes to do anything meaningful.
The basic idea behind sampling and filtering is that if a variant is present both in the reference graph and the sequenced genome, its presence in the graph will make read mapping more accurate. But if a variant is present in the reference but not in the sequenced genome, it can be misleading and lower the accuracy. In general, the former effect is greater than the latter.
Proportional sampling and frequency filtering both aim to create a single universal reference that includes common variants, excludes rare variants, and may include or exclude variants that are moderately common. But if you don't have enough haplotypes, you can't tell the difference between common and rare variants reliably enough, and these approaches will likely fail to improve the accuracy.
In contrast, a personalized reference tries to include variants that are present in the sequenced genome and exclude variants that are not present. But because this is done at the level of 10 kbp blocks rather than individual variants, you need enough haplotypes to have a reasonable chance of finding good enough haplotypes for most blocks.
If you have a small number of genomes in the graph, none of these techniques will likely improve the accuracy.
Thank you so much for your response! It gave me a much better understanding. You're awesome!
Please accept the answer (green check mark) to provide closure to this thread.