Question

R: How to correctly compute p-value of a k-mer?

0

Entering edit mode

8.4 years ago

francois ▴ 90

In R Studio: I have a chromosome sequence loaded (chr21 of hg19) and I have computed the frequencies of each nucleotide and N:

chr21 <- readDNAStringSet ('chr21.fa')[[1]]

freq <- letterFrequency (chr21, c('A', 'C', 'G', 'T', 'N'), as.prob = T)

I have written a function to compute the probability of a word (in DNAString type) given the probabilities of the nucleotides (Bernouilli model):

Bernoulli <- function (word, freq) {
  P <- 1
  for (i in 1:length (word))
    P <- P * freq[as.character(word[i])]
  as.numeric (P)
}

I have computed the occurences and frequency of every 7-nucleotides words in chr21:

sevenmer_occ = oligonucleotideFrequency (chr21, 7)
sevenmer_freq = oligonucleotideFrequency (chr21, 7, as.prob = T)

Now I am asked to compute "the p-value of each 7-nucleotides words (oligomers) given the Bernouilli model".

I cannot find how to do that... I think I should use the function pbinom for this, but I don't think I understand it well, as it is mostly giving me 0 or 1.

Can you help?

R genome motif • 2.4k views

ADD COMMENT • link updated 8.4 years ago by Steven Lakin ★ 1.8k • written 8.4 years ago by francois ▴ 90

0

Entering edit mode

The question becomes what the p-value is meant to compare. The naive computation you show might or might not be meaningful for answering your actual biological question. So describe the question you're actually trying to answer and we can help you further.

ADD REPLY • link 8.4 years ago by Devon Ryan 105k

0

Entering edit mode

Sorry, I first uploaded my question by mistake, I did not know people could read it while I was editing. Should be much clearer now!

ADD REPLY • link 8.4 years ago by francois ▴ 90

score 2 · Answer 1 · 2016-12-12

There are two ways to formulate this question: 1. What is the probability of seeing a given k-mer with certain nucleotide probabilities (this is more like a hidden markov model but in a site-directed manner) and 2. What is the frequency of seeing a given k-mer in a bag of k-mers, and is that frequency higher or lower than it should be by a null hypothesis? I believe you want to examine #2, not #1.

Think of it as a bag of words problem where you have a bag of k-mers (total # of 7-mers examined in your chromosome) and a draw of a single k-mer is considered a success if it matches the one you want. We can compute the null hypothesis (probability of seeing a 7-mer with equally likely nucleotide choices) as (1 / 4)^k, so in your case, this is:

P_null = 0.25^7 = 0.000061035

Now, going back to your bag of k-mers, you have observed each of these 7-mers a given number of times. To compute probability based on observations, a total number of words, and the probability of the null, we use the exact binomial test:

for( kmer in all_posible_kmers ) {
    kmer_results <- binom.test(x = num_observations[kmer],
                               n = total_num_kmers,
                               p = 0.000061035,
                               alternative = 'two.sided')
    kmer_pvalue <- kmer_results$p.value
}

x is the number of observations in your data for that given kmer, n is the total number of words (number of trials), and p is the null p. You'll have to construct the loop to iterate over all kmer counts of interest. You want to consider a two-sided test, since you're interested in higher or lower than the null.