Is there a way to calculate the maximum length at which the frequencies of k-mers can be accurately estimated?
0
0
Entering edit mode
5.5 years ago
Joseph Hughes ★ 3.0k

A reviewer has asks us to calculate the theoretical range of optimal k-mer sizes for a given database of viral sequences, stating: "The optimal range lies between: the minimum size for which a maximum number of different features can be found in the string (the viral genome); and the maximum length at which the frequencies of k-mers can be accurately estimated"

I have been trawling the web trying to find an answer to this question and read this paper by Sims et al (2009), however it is not entirely clear to me how from an eclectic set of unaligned sequences in my database, I can calculate this theoretical minimum and maximum k-mer sizes.

k-mer frequency • 1.8k views
ADD COMMENT
0
Entering edit mode

Presumably the limitation on 'accuracy' estimation, isn't actually accuracy, and is more like computability? In principle you should be able to calculate exactly the number of kmers of any size, assuming the scaling hasn't rendered it impractical to compute.

Could you perhaps just plot a distribution for varying kmer sizes and perhaps extrapolate to the point where it begins to plateau or no new information is/kmers are added? (If I understand the question).

ADD REPLY
0
Entering edit mode

What do you mean by information? Entropy?

ADD REPLY
0
Entering edit mode

No, sorry for the confusion. I just mean until essentially the distribution begins to fall away (at the most extreme example, the longest kmer you can get would be the whole genome with an occurrence of 1), so the longer your kmers become, the less frequent they must be. This should give you a negative gradient on a graph of kmer length vs occurrence, and extrapolating from that to find a maximum 'meaningful' length might be enough to placate your reviewer?

(Mostly thinking aloud, rather than providing a qualified answer in any real way!)

ADD REPLY

Login before adding your answer.

Traffic: 2658 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6