I am trying to get some kmer analysis (counting, get some **scores and testings) in some genomes and I am reading some stuff, but some are so obscure and boring, because of a much far-fetched explanation or confusing formulas and derivations. I wish a reference a lil bit light, concise, and clear of what is going on. I am not a math or statistician so, I need some clear and direct.
Do you guys have some directions.
I am using some of this:
P(W) = P(W1 | W2...Wn-1) * P(Wn | W2...Wn-1) * P(W2...Wn-1)
Probability that an arbitrary n-mer is the word (W) it will be used 3 components: the probability that the core (n-2) bases match, and the probabilities of the first and last bases given that the core matches.
E(C(W)) = C(W1...Wn-1) * C(W2...Wn) / C(W2...Wn-1)
E(C(W)) is the expected value for the count of the number of times W occurs in the genome, and C(Wi...Wj) is the actual count of the number of times the word Wi...Wj occurs.
Variance
Var(C(W)) = N* P(W) * (1-P(W)) = E(C(W)) * (1 - E(C(W))/N)
The std
sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N))
And the z-score
Z(W) = (C(W) – E(C(W))) / sigma(W)
to detect under/over abundant kmers.
I would like to learn and be pointed to some other scores and more importantly tests for the analysis.
I really appreciate any help.
Thank you for your time.
Paulo
I was thinking in create some random genomes based on background bases frequencies that are the same as the original genomes,using something like random.choices. Basically, is what I have seen in some papers. Thank you for your time. And I will check it out yours refs.