I'm building a very basic Markov Model-based prokaryotic gene finder for a class project, and I have been reading some literature about GLIMMER for guidance. If I have understood the basic algorithm correctly, GLIMMER scores a given ORF in all six reading frames, normalizes the six scores so that they represent a probability that the ORF is a gene, and then predicts a gene if the ORF scores above a certain threshold in the correct reading frame (with some filtering for overlaps after this). I have two questions that I hope someone more familiar with these types of algorithms can give me some guidance with.
First, they mention earlier in the paper that intuitively one would want to have a seventh model for non-coding regions, but that this is "not strictly necessary". I'm not sure I understand why this isn't necessary. I imagine a situation where an ORF scores very poorly in all six reading frames, but the normalization makes the correct reading frame stand out, so it appears to be a gene. Wouldn't you need a non-coding model as a reference point?
Second, and probably related, how does one actually do this normalization? Is it as simple as just scaling the six scores so that they add up to 1.0? Or is there a more general way of normalizing the score from a Markov Model that accounts for the length of the sequence?
Please point out any egregious misunderstandings, as I am only just beginning to study these methods.
The paper didn't describe the normalization method. I'm trying to sort through the source code for it, but I haven't had much luck yet.
what is the name of the paper?
Microbial gene identification using interpolated Markov models