Since this is a bit more detailed, I write here as a separate answer.
The problem I think you face: Your frequency values are too large, but your maximizing function, which starts with initial values you set, are too small. That means, the dgamma() function will give just 0's. log(0) is -Infinity and its sum with negative is Inf. So, your model will never converge as it has an error to begin with. Basically, the scale parameter is too large for you and the initial scale value you choose is very small for the model to work. But choosing a HUGE scale value initially, while will converge nicely, will not give accurate results. To overcome this, we normalize your data by dividing it by an arbitrarily huge number (which you can increase if you find that the result does not fit good) and then fit gamma to the normalized variable and then compute the mean and shape and scale from this one. Sounds difficult, but its really easy and there are only 2 changes to the code.
So, how do we proceed?
First change: Normalize
out <- scan("myfile.txt")
out.b <- out # take a backup
out <- out/10000 # divide out by an arbitrarily high value
Remember, mean = expectation of a random variable. That is, E(X) = mean, where X is a random variable (which is out here). Also, E(aX) = a * E(x). So, if you find fit a gamma for out.norm, then, prod(res$par) * 10000 will be the mean of out. And the corresponding shape and scale parameters of your data, which is in out is, shape=shape of out.norm and scale=scale of normalized out * 10000. Also, normalizing by a high value helps estimate the scale parameter nicely (at least with what I tried). Lets code this now (there are not so many changes:
Same maximizing function (written differently for clarity)
# maximizing function
f = function(shape, scale) {
val <- dgamma(out, shape=shape, scale=scale)
t <- -sum(log( val ) )
return(t)
}
Second change: set scale parameter initial value to mean of the normalized out variable
start0 <- list("shape"=1, "scale"=mean(out) )
Also, maxit
is the maximum number of iterations. It does not depend on the number of points you have. Choose a higher number if convergence is not reached after that many runs you chose initially. Here I choose 5000 iterations to start with. IF it converges, fine, else, I'll try increasing it and running it again.
fcn <- function(x) f( x[1], x[2] )
res <- spg(par=as.numeric(start0), fn=fcn, lower=c(1, 1 ), upper=c(Inf, Inf ), control=list( maxit=5000 ) )
res
contains the parameters for normalized out
. Get back the original parameters for out from its parameters.
# get back the parameters
shape_out <- res$par[1]
scale_out <- res$par[2] * 10000
mean_out <- shape_out * scale_out
check_mean_out <- mean(out.b) # check if check_mean_out and mean_out are relatively similar.
This should work. Sorry for the trouble. Let me know if things still don't work out.
Wow,amazing~ very detailed teaching,I believe it will help many people, not only me ~ Thank U :)
you're welcome :)
Arun, I tried this on my data, I scan my data into one vector (named "out" too), and copy this code from the " f = function(shape, scale) { " line down to the end, But it's error! I know little about R , Could you plz show me how to modify this code for customized data? thanks~
Hi gaouangnow, It would be then helpful if you post your code here.
sure
error message: error : spg(par = as.numeric(start0), fn = fcn, lower = c(0, 0), upper = c(Inf, : Failure in initial function evaluation!
I guess out contains the frequency of k-mers? I think the problem is that the numbers are too big. It would be nicer if you could scale out by its mean or max or an arbitrary large value like this:
Now fit the model similarly. If things go well, res$par will contain the fitted shape and scale parameter for the normalized out. I'll try it out from my office and write back. got to run!
Yes, I have too many counts at the frequency=1 position (sequence error or other stuff), maybe I need to filter this out before doing this gamma-fit?
Can you link me to your file?