Question

Whether personalized pangenome considers allele frequency

0

Entering edit mode

5 weeks ago

Wang Cong ▴ 20

Hi, I am making personalized pangenome (vg haplotype). I have WGS data from a mixture of individuals (let's say HG001+ HG002 + HG003). HG001 consists of 90% of the mixture. The other two consists of 5% each. In this case, can I expect the personalized pangenome will approximate HG001 assembly? Or the personalized pangenome will approximent all 3 individual's assemblies in an equal weight?

pangenome vg • 423 views

ADD COMMENT • link 4 weeks ago by Wang Cong ▴ 20

score 0 · Answer 1 · 2025-02-07

0

Entering edit mode

4 weeks ago

Jouni Sirén ▴ 540

The sampling algorithm only sees the k-mer counts in the reads. If 90% of the reads are from the same sample, the result will be close to a personalized pangenome for that sample. The biggest impact is probably from k-mers that are absent from the primary sample but homozygous in the other two samples. Their frequency will often be high enough that they will be classified as heterozygous.

ADD COMMENT • link 4 weeks ago by Jouni Sirén ▴ 540

0

Entering edit mode

Thanks! I am looking at the documentation. How is absent/present/heterozygous determined in this process? Is it through the frequency in the whole k-mer library?

enter image description here

ADD REPLY • link 4 weeks ago by Wang Cong ▴ 20

0

Entering edit mode

vg first estimates kmer coverage from the kmer counts. If you have 30x 150 bp reads, kmer coverage should be 21 or 22 with the default minimizer parameters. If the total frequency of a kmer and its reverse complement is around that, it is classified homozygous. Kmers with frequency close to 50% of the coverage are considered heterozygous, with the threshold being somewhere around 70%. Kmers with frequency below 10% are considered absent, and those above 250% will be ignored.

ADD REPLY • link 4 weeks ago by Jouni Sirén ▴ 540