There are any good tutorial or book to help to get some stats in kmer analysis?
1
0
Entering edit mode
4.2 years ago
psschlogl ▴ 50

I am trying to get some kmer analysis (counting, get some **scores and testings) in some genomes and I am reading some stuff, but some are so obscure and boring, because of a much far-fetched explanation or confusing formulas and derivations. I wish a reference a lil bit light, concise, and clear of what is going on. I am not a math or statistician so, I need some clear and direct.

Do you guys have some directions.

I am using some of this:

P(W) = P(W1 | W2...Wn-1) * P(Wn | W2...Wn-1) * P(W2...Wn-1)

Probability that an arbitrary n-mer is the word (W) it will be used 3 components: the probability that the core (n-2) bases match, and the probabilities of the first and last bases given that the core matches.

E(C(W)) = C(W1...Wn-1) * C(W2...Wn) / C(W2...Wn-1)

E(C(W)) is the expected value for the count of the number of times W occurs in the genome, and C(Wi...Wj) is the actual count of the number of times the word Wi...Wj occurs.

Variance

Var(C(W)) = N* P(W) * (1-P(W)) = E(C(W)) * (1 - E(C(W))/N)

The std

sigma(W) = sqrt(E(C(W)) * (1 - E(C(W))/N))

And the z-score

Z(W) = (C(W) – E(C(W))) / sigma(W)

to detect under/over abundant kmers.

I would like to learn and be pointed to some other scores and more importantly tests for the analysis.

I really appreciate any help.

Thank you for your time.

Paulo

sequence genome • 893 views
ADD COMMENT
1
Entering edit mode
4.2 years ago
khorms ▴ 230

When you are looking for overrepresented k-mers in biological sequences, you are usually comparing one group of sequences to another. That means that the background distribution has to come from the control group of sequences. I am not sure how are you planning to incorporate such background distribution into your framework here. There has been a lot of work published on the subject. I would recommend looking into information theory - based methods such as FIRE (paper, website) because of their flexibility.

ADD COMMENT
0
Entering edit mode

I was thinking in create some random genomes based on background bases frequencies that are the same as the original genomes,using something like random.choices. Basically, is what I have seen in some papers. Thank you for your time. And I will check it out yours refs.

ADD REPLY

Login before adding your answer.

Traffic: 1849 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6