Suppose I have sequenced a haploid genome. There is a position where I suspect I have a point mutation. I have n
reads covering this position, with each read "voting" either A, T, C or G as its nucleotide call. The winner of the vote, in this case, is some nucleotide which is different than the reference, ie. the consensus of the reads implies a mutation. The number of reads voting for the winner is q
.
The probability that each individual read will correctly call a nucleotide is r
. Since the error probabilities are uniform, the probability of each incorrect base is then (1-r)/3=s
. So my null hypothesis is that a plurality of reads happened to make such an error so as to produce an incorrect consensus. Given this, what is the p-value for this mutation call? Is it:
- Probability of getting
q
successes aftern
trials with probability of successs
(Binomial distribution), times 3 for each possible erroneous nucleotide - Probability of getting
q
successes aftern
trials with probability of successs
(Binomial CDF), time 3 - Something else?
Also, is my null hypothesis reasonable?