Question

Where does Kmergenie get it's y axis numbers from? Is it suppose to be the local maxima of each plot?

0

Entering edit mode

9.0 years ago

Tom ▴ 20

This may sound rudimentary, but where is KmerGenie getting it's y-axis (number of genomic k-mers) numbers from when it goes to plot the final plot in the HTML output?

I was under the assumption that it was taking the local maximum of each k histogram's curve, and taking that number and putting it into the Kmerplot. However, my numbers aren't lining up with the histogram's numbers; I have no idea what's going on. What happens when my where should I be looking to calculate the y-value for each kmer in the final graph?

kmergenie kmer • 2.4k views

ADD COMMENT • link updated 2.3 years ago by Ram 44k • written 9.0 years ago by Tom ▴ 20

0

Entering edit mode

If you mean the first graph, then no. It's not a local maxima. I believe it's the area under the graph, but there my be a cutoff, like x > 1.

ADD REPLY • link 9.0 years ago by apelin20 ▴ 480

0

Entering edit mode

Can you make an educated guess as to where its integrating from and to?

ADD REPLY • link 9.0 years ago by Tom ▴ 20

0

Entering edit mode

Send an email to the author, he usually replies pretty quickly.

ADD REPLY • link 9.0 years ago by apelin20 ▴ 480

2

Entering edit mode

For some reason I don't get biostars email notifications anymore.

Adrian (-- I'm assuming it's him) is right, it's indeed an area under the histogram curve. It is weighted by the probability that, for a given abundance, a kmer is erroneous or not.

E.g. if you have 10 kmers of abundance 1, but the model thinks that the density of erroneous kmers at abudance 1 is 0.7, kmergenie will predict (1-0.7)*10 = 3 genomic kmers for this abundance. Then kmergenie sums the predicted number of genomic kmers over all abundances.

This is for the haploid model; for the diploid model, it's slightly more advanced, to take into account heterozygous kmers (and divide their contribution by two).

ADD REPLY • link 9.0 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Thank you for your input.

So considering that camel hump graph that's generated for say, k = 35. All data after the first "dip" should be ignored. Then, I should take the area under the curve from that first local minimum, all the way to the right to the tail end of the graph. That should in theory yield the number of bases equal to my genome size. Am I doing this right?

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Tom ▴ 20

0

Entering edit mode

I would say the "eyeball" estimation would be from the first minima to the end of the peak of the last maxima, but I would trust the model of kmergenie more, as it's more roboust than that.

ADD REPLY • link 9.0 years ago by apelin20 ▴ 480

0

Entering edit mode

Right, you can eyeball the genome size this way.

Actually, this part of Kmergenie is in the R language, so you could have a look yourself if you're familiar with R. It's in scripts/est-genomic-kmers.r. Line 29 holds the abundance histogram, line 31 holds probability that a kmer is correct at this abundance, and line 38 computes the number of genomic kmers (line numbers might change in future releases).

ADD REPLY • link updated 5.0 years ago by Ram 44k • written 9.0 years ago by Rayan Chikhi ★ 1.5k

0

Entering edit mode

Yes, Adrian is I. Thanks for the answer! always nice to learn more.

ADD REPLY • link 9.0 years ago by apelin20 ▴ 480