Question

How To Estimate Coverage Using Gamma-Fit?

0

Entering edit mode

12.7 years ago

GAO Yang ▴ 250

Hi, guys~ Here is a short introduction of a K-mer counting tool. Now, I got a similar data like this example , and then got the number of k-mer of the peak. But it's said to estimate the k-mer coverage using the gamma-fit. How? Could anybody show me how to get this coverage information, say, just using this banana slug as example? Thank you!

coverage • 5.7k views

ADD COMMENT • link updated 12.7 years ago by Arun 2.4k • written 12.7 years ago by GAO Yang ▴ 250

score 5 · Answer 1 · 2012-05-24

5

Entering edit mode

12.7 years ago

Arun 2.4k

Suppose you bin the K-mers to find the frequency or no: of times you find k-mers that occur only once, occur twice etc... and so on and you have them in a vector. Then, because the ideal K-mer distribution (the one without initial errors they talk about under Gamma distribution is wrong) looks like a gamma distribution, they go ahead and fit this frequency values using a model that estimates the best possible parameters of gamma distribution.

A Gamma distribution has two parameters shape and scale and its mean = shape * scale. And that is the average coverage. So, all we have to do is fit a gamma distribution and obtain the product of the parameters to get the average coverage (or at least that is what they do, assuming your gamma fit is accurate, which may or may not be. You can check this by plotting your k-mer frequency for different k-mers as they do). So, here's how you could obtain the parameters. I had the answer before, but I deleted it because I thought I interpreted your question wrong.

Just to illustrate that the model works, I'll take a vector of 1000 values from a gamma distribution with shape=2 and scale=15. So, its mean is 30. We can then verify how good the gamma fit is to this vector by looking at the estimated parameters. Here's a script in R which uses BB package

# install.packages("BB") # if you already don't have it.
require(BB)

# generate test data to check model efficiency
set.seed(100) 
gamma.fn <- function(n, shape=2, scale=15) {
    u <- runif(n) 
    out <- apply( as.matrix(u), 1, function(x) rgamma(1, shape=shape, scale=scale) )
    out 
}
out <- gamma.fn(1000)

# maximizing function
f = function(shape, scale) { 
     -sum(log(dgamma(out, shape=shape, scale=scale) ) )
}

# arbitrary start
start0 <- list("shape"=1.2, "scale"=4 ) 

fcn <- function(x) f( x[1], x[2] )   
res <- spg(par=as.numeric(start0), fn=fcn, lower=c(0, 0 ), upper=c(Inf, Inf ), control=list( maxit=1000 )  )

Doing this, I obtain the parameters in res$par and shape=1.956524 and scale=14.750196. And mean = 28.85912. The fit seems to be very close to that of your input values for which you wanted a model fit. Now you can try this code by using your k-mer frequency in out and estimating the parameters and obtaining mean.

I hope its clear. If not, please feel free to comment.

ADD COMMENT • link 12.7 years ago by Arun 2.4k

0

Entering edit mode

Wow,amazing~ very detailed teaching,I believe it will help many people, not only me ~ Thank U :)

ADD REPLY • link 12.7 years ago by GAO Yang ▴ 250

0

Entering edit mode

you're welcome :)

ADD REPLY • link 12.7 years ago by Arun 2.4k

0

Entering edit mode

Arun, I tried this on my data, I scan my data into one vector (named "out" too), and copy this code from the " f = function(shape, scale) { " line down to the end, But it's error! I know little about R , Could you plz show me how to modify this code for customized data? thanks~

ADD REPLY • link 12.7 years ago by GAO Yang ▴ 250

0

Entering edit mode

Hi gaouangnow, It would be then helpful if you post your code here.

ADD REPLY • link 12.7 years ago by Arun 2.4k

0

Entering edit mode

sure

>require(BB)

>out <- scan("data.txt") # 940 items which are the number of K-mer at diff multiplicity

>f = function(shape, scale) { -sum(log(dgamma(out, shape=shape, scale=scale) ) ) }

>start0 <- list("shape"=1.2, "scale"=4 ) # I also tried diff values for these two var

>fcn <- function(x) f( x[1], x[2] ) 

>res <- spg(par=as.numeric(start0), fn=fcn, lower=c(0, 0 ), upper=c(Inf, Inf ), control=list( maxit=940) )

error message: error : spg(par = as.numeric(start0), fn = fcn, lower = c(0, 0), upper = c(Inf, : Failure in initial function evaluation!

ADD REPLY • link 12.7 years ago by GAO Yang ▴ 250

0

Entering edit mode

I guess out contains the frequency of k-mers? I think the problem is that the numbers are too big. It would be nicer if you could scale out by its mean or max or an arbitrary large value like this:

out <- scale("my_file.txt")
m   <- mean(out)
out <- out / m

Now fit the model similarly. If things go well, res$par will contain the fitted shape and scale parameter for the normalized out. I'll try it out from my office and write back. got to run!

ADD REPLY • link 12.7 years ago by Arun 2.4k

0

Entering edit mode

Yes, I have too many counts at the frequency=1 position (sequence error or other stuff), maybe I need to filter this out before doing this gamma-fit?

ADD REPLY • link 12.7 years ago by GAO Yang ▴ 250

0

Entering edit mode

Can you link me to your file?

ADD REPLY • link 12.7 years ago by Arun 2.4k

score 2 · Answer 2 · 2012-05-30

Since this is a bit more detailed, I write here as a separate answer.

The problem I think you face: Your frequency values are too large, but your maximizing function, which starts with initial values you set, are too small. That means, the dgamma() function will give just 0's. log(0) is -Infinity and its sum with negative is Inf. So, your model will never converge as it has an error to begin with. Basically, the scale parameter is too large for you and the initial scale value you choose is very small for the model to work. But choosing a HUGE scale value initially, while will converge nicely, will not give accurate results. To overcome this, we normalize your data by dividing it by an arbitrarily huge number (which you can increase if you find that the result does not fit good) and then fit gamma to the normalized variable and then compute the mean and shape and scale from this one. Sounds difficult, but its really easy and there are only 2 changes to the code.

So, how do we proceed?

First change: Normalize

out    <- scan("myfile.txt")
out.b  <- out          # take a backup
out    <- out/10000    # divide out by an arbitrarily high value

Remember, mean = expectation of a random variable. That is, E(X) = mean, where X is a random variable (which is out here). Also, E(aX) = a * E(x). So, if you find fit a gamma for out.norm, then, prod(res$par) * 10000 will be the mean of out. And the corresponding shape and scale parameters of your data, which is in out is, shape=shape of out.norm and scale=scale of normalized out * 10000. Also, normalizing by a high value helps estimate the scale parameter nicely (at least with what I tried). Lets code this now (there are not so many changes:

Same maximizing function (written differently for clarity)

# maximizing function
f = function(shape, scale) { 
    val <- dgamma(out, shape=shape, scale=scale)
    t <- -sum(log( val ) )
    return(t)
}

Second change: set scale parameter initial value to mean of the normalized out variable

start0 <- list("shape"=1, "scale"=mean(out) )

Also, maxit is the maximum number of iterations. It does not depend on the number of points you have. Choose a higher number if convergence is not reached after that many runs you chose initially. Here I choose 5000 iterations to start with. IF it converges, fine, else, I'll try increasing it and running it again.

fcn <- function(x) f( x[1], x[2] )
res <- spg(par=as.numeric(start0), fn=fcn, lower=c(1, 1 ), upper=c(Inf, Inf ), control=list( maxit=5000 )  )

res contains the parameters for normalized out. Get back the original parameters for out from its parameters.

# get back the parameters
shape_out <- res$par[1]
scale_out <- res$par[2] * 10000
mean_out <- shape_out * scale_out
check_mean_out <- mean(out.b) # check if check_mean_out and mean_out are relatively similar.

This should work. Sorry for the trouble. Let me know if things still don't work out.