Hidden Markov Models Within Sequence Analysis -- Dispelling Misconceptions + Fixing Explanations
3
3
Entering edit mode
13.1 years ago
Delinquentme ▴ 200

So I've read a few threads on here... and it seems that there are quite a few questions about what SPECIFICALLY makes up a hidden markov model:

... I'll do my very best to present this information clearly ...

(please correct any of this explicitly if its incorrect ... or just correct it, as I've made this a community wiki. )

primary assumption:

secondary assumptions:

Questions:

1) Memoryless means it is not influenced by the previous sequences ( finding / states / residues / determinations / whatever the previous "items" were ) ... How can one say that the evolutionary precursors to a genomic sequence have "no impact"? ...

NOTE: It makes sense that the entire gamut of evolution which lead up to that genome ... would influence it

2) Specifically what makes gives this process its designation of "hidden" ( in "Hidden Markov Model" ) ? As mentioned in this post: http://biostar.stackexchange.com/questions/1221/when-can-a-markov-model-be-described-as-hidden

It is stated ( from wiki link ) that "the state is not directly visible, but output, dependent on the state, is visible."

NOTE: The sequence is there and we're reading it with some process... there is nothing "hidden" about this.

Does the "hidden" refer to not knowing the evolutionary process which placed that nucleotide there?

OR

Does the "hidden" designation refer to some process-dependent applications... say sequencing done by fluorescing molecules ( which would be indicative of a specific nucleotide ) ... but not a "direct read"

"direct read" as meaning "Yes unequivocally , this is a Cytosine molecule here" ... not by perception of causal relation ... but instead by "directly reading"

Please Specifically label and answer questions 1 and 2 =]

sequencing • 5.7k views
ADD COMMENT
1
Entering edit mode

HMM can be used for so many things. What specific application are you talking about? HMMER?

ADD REPLY
1
Entering edit mode

@lh3 I was speaking for the application of DNA sequencing ... specifically de novo

ADD REPLY
1
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? In the small area of sequence analysis alone, HMM can model so many things with hidden states interpreted very differently. You should give the exact biological problem you are thinking about.

ADD REPLY
1
Entering edit mode

@lh3 specifically HMM application to sequencing a individual humans genome, with multiple reads where the HMM model is trained by those multiple alignments to deduce the consensus genome

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Anyway, your question is too vague. You should really give the specific problem you want to solve. HMM can be used to model so many things and the interpretation of the hidden states varies greatly. And without the description of your problem, I am even not sure if your problem in mind can be solved by HMM in the first place. In all, please be specific.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Anyway, your question is too vague. You should really give the specific problem you want to solve. HMM can be used to model so many things and the interpretation of the hidden states varies greatly.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Anyway, your question is too vague. You should really give the specific biological problem you want to solve. HMM can be used to model so many things and the interpretation of the hidden states varies greatly.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Anyway, your question is too vague. You should really give the specific biological problem you want to solve. HMM can be used to model so many things and the interpretation of the hidden states varies greatly. Before you thoroughly understand HMM in a very specific application, there is no way you can understand HMM in a more general context.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Anyway, your question is too vague. You should really give the specific biological problem you want to solve. HMM can be used to model so many things and the interpretation of the hidden states varies greatly. There are things common to all HMMs, but before you thoroughly understand HMM in a very specific application, there is no way you can understand HMM in a more general context.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? Your trouble is to mix abstract concepts with detailed applications, which is really confusing to me. You should really give the exact biological problem you are thinking.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? In the small area of sequence analysis alone, HMM can model so many things with hidden states interpreted very differently. You should really give the exact biological problem you are thinking.

ADD REPLY
0
Entering edit mode

What is the exact "application of DNA sequencing"? What do you mean by "de novo"? Are you thinking to find de novo SNPs from a family trio using sequencing data (like Conrad et. and Roach et al)? In the small area of sequence analysis alone, HMM can model so many things with hidden states interpreted very differently. You should really give the exact biological problem you are thinking about. Your question is entirely confusing to me. Sorry.

ADD REPLY
0
Entering edit mode

@delinquentme I'd second the suggestion to be more specific about the application of HMM you are interested in understanding. Typically you are not using an HMM to predict nucleotides. As you point out we're observing the nucleotides from our sequencing experiments. The goal of a HMM to predictions about some biological property from the observed sequence. For example we observe a sequence, and we want to known where the genes are. The observed data are the nucleotide labels, and the hidden property is "in gene", "not in gene".

ADD REPLY
0
Entering edit mode

@deliquentme, are thinking of using HMM to align sequences? In that case the nucleotide labels in the sequences are the observed data, and the hidden states are whether that position represents an insertion, a deletion, or a substitution.

ADD REPLY
0
Entering edit mode

@charles ... but wouldn't that be deduced from getting multiple coverage .. and simply figuring out which is the most statistically probable sequence?

ADD REPLY
0
Entering edit mode

If you are thinking to infer a consensus without gaps, I do not see the point of using HMM. Most simplistic methods will work sufficiently well. HMM becomes really powerful when you start to deal with gaps, but my impression is you have not been prepared for such complexity. Read my BAQ paper [PMID:21320865]. Not the same, but very relevant.

ADD REPLY
0
Entering edit mode

If you are thinking to infer a consensus without gaps, I do not see the point of using HMM. Most simplistic methods will work sufficiently well. HMM becomes really powerful when you start to deal with gaps, but my impression is you have not been prepared for such complexity. Read my BAQ paper [PMID:21320865]. Not the same, but relevant if you think in the right way.

ADD REPLY
0
Entering edit mode

Do you thoroughly understand the few simple examples in Richard Durbin's "Biological sequence analysis"? If not, understand those examples first and then revisit your own questions.

ADD REPLY
0
Entering edit mode

Do you thoroughly understand the few simple examples in Richard Durbin's "Biological sequence analysis"? If not, understand those examples first and then revisit your own questions.

ADD REPLY
0
Entering edit mode

@delinquentme, as ih3 says you don't need an HMM if you are just piling up reads and throwing out ones with mis-matches. On the other had HMM are useful if want to start worrying about whether a mis-match between an assembled sequence and a reference genome is a SNP or a sequencing error. Is their some particular program or paper you are trying to understand? HMM are used to solve all sorts of problems from speech recognition, to sequence alignment, to gene finding, to protein structure determination, the details and the vocabulary vary from problem to problem.

ADD REPLY
0
Entering edit mode

@delinquentme, with respect to question one: if you are talking about sequence alignment, the HMM isn't concerned with the different states of the sequence over evolutionary history, except for the assumption that the sequences have a common ancestor. Rather we're looking at how states change as we move from left to right in the sequence. Suppose we are doing sequence alignment with gaps. We can't observe gaps, we can only infer them. Roughly speaking the probability that a position is a gap depends primarily on whether or not the position immediately to its left is a gap.

ADD REPLY
0
Entering edit mode

@Charles is there a succinct text on sequencing, gaps additions and deletions? Im a programmer whos pretty new to biology... I'd love to get to wrapping my head around this more

ADD REPLY
3
Entering edit mode
13.1 years ago
Fabian Bull ★ 1.3k

Where to start:

Are you trying to understand HMMs or are you trying to understand a specific application? If second: Post a source and I'll try to explain it.

Basics about Hidden Markov Models:

A HMM is a so called generative model (with the meaning: it generates your data) You can imagine a HMM as a model with a given number of states. You start at a specific state with a given probability and transition from one state to another with a given probability. While you are in a state you emite symbols with a given probability. During this process you generate your sequential data.

Example: This example is a classical one so no credits for me. It is called "the occasionally dishonest casino". Your are in a casino playing a game. A dice is tossed and everytime the dice shows a six you win if not you lose. Surely the casino wants to maximize its profit so it sometimes uses a loaded dice with a lower probability of showing a six. The loaded and the fair dice represent two states (with probablities of emiting numbers 1-6). The casino switches dices with a given transition probability.

What you observ is the outcome of the dice. What you do not observer is "which dice was used" (the state). Thats why the sequence of states during the process is called hidden.

Back to your post:

Your second assumptions need to be more specific:

Markov property: The markov property states, that a process is "memory-less". This statement is often described wrong. Most times people say something like: "the state you are in only depends on the state you were one step before". THAT IS WRONG. The correct statement would be: "the state you are in does not depend on all its precursors given the very last state you were before you entered your current state". (Sorry for this sentence but people smarted then me tried to formulate this difficult statement and failed). It is often stated that markov models with lower order can not model higher dependencies. That is also wrong.

Stochastic: A process is stochastic if its behavior is ruled by some kind of randomness. In a HMM this would be: Your data can be generated be multiple state-sequences. Every different state-sequence has a distinct probability. Often times your data can be generated by every state-sequence so the probabilities some up to one.

Algorithms often used with HMMs:

  1. Viterbi-Algorith gives you the most probable state-sequence that could have emitted your data.
  2. Forward-Backward-Algorithm computes the likelihood of your data given a HMM
  3. Baum-Welch-Algorithm gives your the parameters of the model that maximizes the likelihood.

For a good introduction to HMMs see: BSA

ADD COMMENT
0
Entering edit mode

@peri4n " The correct statement would be: "the state you are in does not depend on all its precursors given the very last state you were before you entered your current state".

^ this makes somewhat more sense. However specifically in bioinformatics we are capable of taking into consideration evolution within DNA ( thus offering us a HUGE set of data to compare against? )

ADD REPLY
0
Entering edit mode

@peri4n could you also define "state-sequence" ... does this mean the DNA being in a particular "state" ... IE being one read from a multiple read sequence?

ADD REPLY
0
Entering edit mode

@peri4n could you also define "state-sequence" ... it seems as if it is the occurrence of an ACTG ?

ADD REPLY
0
Entering edit mode

@peri4n lastly the "hidden" is the "what biological reason is this sequence here? "

3 total questions :D ... if you could hit on them all .. that would be awesome!

ADD REPLY
0
Entering edit mode

very simple answer: your 3 questions is irrelevant. peri4n has quoted a definition of HMM, that does exist independently from applications like biol. sequence analysis. In a sequence analysis problem the nucleotides or AAs may correspond to the emission alphabet of each state, not the states itself.

ADD REPLY
2
Entering edit mode
13.1 years ago
Cjt ▴ 370

I get the feeling that you have some problems to understand the concept of the hidden states. Lets have an example. A default Markov Chain (or Marvok :D) gives you the probability for having sun or rain when you had sun or rain the day before. For instance, the observed state was sun then p(today|sun) can be calculated by the according transition probabilities.

No image that there might be some errors with your observation. Lets assume you use a frog as a forecaster. When he is up the ladder that it is very likely that the sun is shining. But maybe he was just too lazy to climb down? The real observations, sun or rain, cannot be seen. These are hidden and thus have to be modeled as hidden states. To do so, you need to train additional parameters which describe how likely the frog stays down/on top when the weather is fine/rainy.

In terms of sequence comparison you can image HMMs as PSWMs with inserts and deletions. To train these HMMs you use several input sequences to determine the likelihoods for each state. When applying your model you then have to calculate the probabilities for each observed state (nucleotide) to belong to a hidden state of the profile (match or mutation) or if it is part of an insert state or a deletion state.

ADD COMMENT
0
Entering edit mode

not too sure about the frog example, but the last paragraph is quite helpful.

ADD REPLY
1
Entering edit mode
13.1 years ago
Delinquentme ▴ 200

I think I've found what I was after:

" This is the distinction between HMMs and a standard Markov model with nothing to hide: in an HMM, the state sequence (e.g. the biologically meaningful alignment) is not uniquely determined by the observed symbol sequence, but must be inferred probabilistically from it. " - (Eddy, 1998) http://bioinformatics.oxfordjournals.org/content/14/9/755.full.pdf

What is "hidden" is the ( statistically ) most probable nucelotide for that particular spot within that chromosome.

We observe what nucelotide in that spot. However it is possible for that particular nucleotide to be a substitution / SNP.

Therefore we need to sequence multiple versions of that specific spot within the genome ... in order to decipher which is the statistically most prevalent nucleotide for an individual.

... the only remaining question is: " is "biologically relevant" the same thing as " statistically most prevalent "

ADD COMMENT
1
Entering edit mode

A markov model is called "hiddden" if the state sequence of the generating process is hidden.

ADD REPLY
0
Entering edit mode

@peri4n ... can i get more? this isn't entirely clear ... the "state sequence" being ... " for what biological reason is this nucleotide here? "

ADD REPLY
0
Entering edit mode

@peri4n: exactly, this is the correct answer! @delinquentme: what more do you need? A HMM is a concept in its own right, it doesn't need a biological justification. It has been used e.g. in language recognition, too.

ADD REPLY

Login before adding your answer.

Traffic: 2721 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6