Hi everyone,
I am working on copy number analysis and want to apply HMM on my data.
Say, I have data for1 individual with ~60k windows. I know about each window that if there is gain, loss or normal copy number. Eg-
chr1 0 100 Loss
chr1 500 600 Loss
chr1 600 700 Gain
What I want to do-
I want to find if any window contains the observed state due to errors. So I want to have true state based on previous states. Eg - I have, say, 10 windows which have following copy number-
Loss
Loss
Loss
Loss
Normal
Loss
Loss
Loss
Loss
Loss
In the above example, we can say that, the copy number in 5th window (Normal) is probably due to some errors, so we can set the true state of 5th window as Loss. (this is only one simple example as there will be lot more different cases where we cannot decide just by looking).
What I have understood-
I can define my 3 states as - Gain, Loss and Normal.
Then I can randomly assign state transition probability and observation probability.
Then apply Baum-Walch algo for fitting parameters (to normalize my random probabilities based on sequence of states in my 60k windows).
Then apply Viterbi algo for getting the true states.
Questions-
Do you think it is appropriate to apply HMM on my data or I misunderstood everything wrong and it is not a good idea?
If HMM will work, can somebody tell me if I need to change something in my aforementioned steps.
Although Baum algo will be used for fitting but I really have a bad feeling for assigning probabilities randomly in the beginning?
P.S: Please let me know if I should post this question on stats stack exchange but I thought it makes more sense to post it here (Biological data + Algorithms).
Thanks in advance,
Vikas
EDIT: If you think this problem can be solved by using some other algo or procedure, please let me know.
somehow my comment never posted in the morning. Anyways I was just wondering can`t you just use a simple machine learning classifier to classify your dataset into those 3 classes?
Can you please elaborate or it would be great if you can put a small example as an answer that how will you deal with this problem? (Please see my Edit)
Sure. This paper should give you a brief overview and a good introduction http://www.informatica.si/PDF/31-3/11_Kotsiantis%20-%20Supervised%20Machine%20Learning%20-%20A%20Review%20of...pdf