Question

Which peak is homozygous and heterozygous in Kmer plot for Genome estimation

1

Entering edit mode

9.8 years ago

Prakki Rama ★ 2.7k

Hi all,

How do we know, which peak is homozygous and heterozygous when we generate a kmer plot for estimating genome size? Would be thankful to your directions.

genome kmer Assembly • 6.2k views

ADD COMMENT • link updated 9.8 years ago by thackl ★ 3.0k • written 9.8 years ago by Prakki Rama ★ 2.7k

score 4 · Accepted Answer · 2015-02-18

4

Entering edit mode

9.8 years ago

thackl ★ 3.0k

Assuming a diploid organism (and two peaks) , the heterozygous peak is the first peak, ideally at 1/2 the coverage of the second, hopefully larger, homozygous peak. This is simply because every homozygous site occurs in two alleles, while every heterozygous site only occurs in one allel, hence producing a signal at half the expected genome coverage

ADD COMMENT • link 9.8 years ago by thackl ★ 3.0k

0

Entering edit mode

Thank you. But what about other small peaks appear in the plot after homozygous regions? They must be repetitive regions with higher coverage? Am I right?

ADD REPLY • link 9.7 years ago by Prakki Rama ★ 2.7k

1

Entering edit mode

Yes, additional peaks after the C2-peak (diploid genome peak) represent regions with higher copy number such as repeats. However, for forming a peak, you need a larger region or many sequences of very similar copy numbers.

Repeats usually don't form a peak, as each repeat is small and different repeats have different copy numbers.

But for example, I've got a plot from a small genome with high gene content, with a small distinct peak at C4. This peak comprises duplicated gene families. Also mitochondrium and chloroplast produce their own peak at their respective coverage (Often 100-10000 times the genome coverage). Partial genome duplications or chromosome aberrations can produce additional distinct peaks as well. And also bacterial contaminations, symbionts and parasites might produce peaks.

You can estimate the "size" of a peak to get an idea of what it represents. Simply sum up the count*coverage of kmers in the peak region.

ADD REPLY • link 9.7 years ago by thackl ★ 3.0k