Question

Hidden Markov Model On Copy Number Data

1

Entering edit mode

12.4 years ago

Vikas Bansal ★ 2.4k

Hi everyone,

I am working on copy number analysis and want to apply HMM on my data.

Say, I have data for1 individual with ~60k windows. I know about each window that if there is gain, loss or normal copy number. Eg-

chr1       0           100         Loss
chr1       500         600         Loss
chr1       600         700         Gain

What I want to do-

I want to find if any window contains the observed state due to errors. So I want to have true state based on previous states. Eg - I have, say, 10 windows which have following copy number-

Loss
Loss
Loss
Loss
Normal
Loss
Loss
Loss
Loss
Loss

In the above example, we can say that, the copy number in 5th window (Normal) is probably due to some errors, so we can set the true state of 5th window as Loss. (this is only one simple example as there will be lot more different cases where we cannot decide just by looking).

What I have understood-

I can define my 3 states as - Gain, Loss and Normal.
Then I can randomly assign state transition probability and observation probability.
Then apply Baum-Walch algo for fitting parameters (to normalize my random probabilities based on sequence of states in my 60k windows).
Then apply Viterbi algo for getting the true states.

Questions-

Do you think it is appropriate to apply HMM on my data or I misunderstood everything wrong and it is not a good idea?

If HMM will work, can somebody tell me if I need to change something in my aforementioned steps.

Although Baum algo will be used for fitting but I really have a bad feeling for assigning probabilities randomly in the beginning?

P.S: Please let me know if I should post this question on stats stack exchange but I thought it makes more sense to post it here (Biological data + Algorithms).

Thanks in advance,

Vikas

EDIT: If you think this problem can be solved by using some other algo or procedure, please let me know.

hmm cnv • 4.5k views

ADD COMMENT • link updated 12.4 years ago by Qdjm 1.9k • written 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

somehow my comment never posted in the morning. Anyways I was just wondering can`t you just use a simple machine learning classifier to classify your dataset into those 3 classes?

ADD REPLY • link 12.4 years ago by Gjain 5.8k

0

Entering edit mode

Can you please elaborate or it would be great if you can put a small example as an answer that how will you deal with this problem? (Please see my Edit)

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Sure. This paper should give you a brief overview and a good introduction http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf

ADD REPLY • link 12.4 years ago by Gjain 5.8k

score 2 · Answer 1 · 2012-06-22

2

Entering edit mode

12.4 years ago

Qdjm 1.9k

If your input is already Gain, Loss and Normal then it's not clear how much more useful an HMM would be over a simple heuristic.
1. Consider using maximum marginal probability (MMP) to infer Gain, Loss or Normal at each position rather than Viterbi. Viterbi gives you the most likely path through the states, MMP tells you the most likely state at each position.
2. In general, starting with random parameters is fine. However, in your case, I presume that you want each state to correspond to one of Gain, Loss or Normal. If you randomly initialize, there's no guarantee that this will happen because HMMs can get stuck in local minima. So I recommend initializing the parameters to point the HMM toward the answer you expect by setting them based on what you think that the final parameter values will be, e.g. initializing the "Gain" state to have a high probability of outputting "Gain" and a small probability of outputting the other states. Baum-Welch will refine your initial settings to make them a better match to the data. However, be careful about assigning zero probabilities, because Baum-Welch will keep that probability equal to zero. Of course, if you think that the zero is appropriate then use it.

ADD COMMENT • link 12.4 years ago by Qdjm 1.9k

0

Entering edit mode

Hi, Thanks for your reply. I have some questions. "If your input is already Gain, Loss and Normal then it's not clear how much more useful an HMM would be over a simple heuristic" - can you please explain this little bit that why it is not clear?

Can you please provide some good citations for "MMP" (it would be great if includes the comparison with Viterbi) ?

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

2

Entering edit mode

If your observed data is already clearly defining the predicted state, then there's no hidden states to learn -- so an HMM-based smoothing of your data is going to be roughly equivalent to a simple rule like : "don't change state unless you see two observations of the new state in a row", depending on how often state changes occur.
You can calculate the marginal distribution probability of the hidden state using forward-backward. It's called "smoothing" in the HMM Wikipedia article. Viterbi computes "the most likely explanation", which is described in the next paragraph in that article.

ADD REPLY • link 12.4 years ago by Qdjm 1.9k

0

Entering edit mode

Thanks for your reply. I will read about this.

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Hi! I read about smoothing and now I understood the difference between the output of viterbi and smoothing although I have some confusion but I think that question is more suitable for stats exchange. Just a small question, if I would use smoothing, should I run "Baum-Welch" for fitting first?

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

Yes. Smoothing is just a different way of deciding on a hidden state sequence.

ADD REPLY • link 12.4 years ago by Qdjm 1.9k

0

Entering edit mode

Thanks. I posted some question related to this at stats exchange here. From the answer - "It is generally not possible to just paste together the most probable states from the marginal conditional distributions to a sequence and claim that the resulting sequence has merits as a sequence.". Any comments?

ADD REPLY • link 12.4 years ago by Vikas Bansal ★ 2.4k

0

Entering edit mode

See the last sentence: " I would recommend, if possible, to avoid the hard imputation and work with the conditional distribution of states given emissions as provided by the model. For instance through simulations." The recommendation of NRH is the same as mine. You do want to be careful if the transition matrix has zero probabilities, and "smoothing" only tells you the most likely state at each point, which I assume is closer to what you want than the sequence of hidden states. I think that you've got enough feedback from us -- do a bit of work on your own and figure it out for yourself, that's what grad school is for.