Hello everyone, I have asked this question in Stack Exchange Bioinformatics forum, unfortunately I have not been able to get any help there. So, I am asking it here, maybe some R professionalists can help me to fix the final part of my R-code. I would really appreciate if anyone can help me to resolve the issue.
I am trying to estimate population allele frequency for CNVs (it can only result in genotyping CNV as a dominant marker) using EM algorithm manual in R. Below is the first few lines of my data-set. I have 40,000 genes and for each gene I have 2 rows (the first row is number of individuals that do not have the gene, __a, and the second row is the number of individuals that actually have the gene, __p). The present gene is dominant and can be either homozygote (pp) or heterozygote (pa).
head(data_all)
genes pop_1
1 Ha1_00044004__a 0
2 Ha1_00044004__p 8
3 Ha1_00043269__a 1
4 Ha1_00043269__p 7
5 Ha1_00045379__a 0
6 Ha1_00045379__p 8
I want to estimate the frequency of present genes in my population. I have used the following code, but I am stuck in the last part (I would like to repeat the process unless a condition is met and it is converged).
even_indexes<-seq(2,48216,2) ## draw even rows which are counts for present genes
x.loadings <- data.frame(x=data_all[even_indexes,2])
mu.p <- 0.5
mu.a <- 0.5
## step E (loop through all genes)
out_res <- NULL
for (i in 1:nrow(x.loadings)){
#ABOexpect <- function(pp,mu.a,mu.o){
mu.p <- 0.5
mu.a <- 0.5
npp <- x.loadings[i,] * (mu.p^2/(mu.p^2 + 2*(mu.p*mu.a)))
npa <- x.loadings[i,] * ((2*(mu.p*mu.a)/(mu.p^2 + 2*(mu.a*mu.p))))
N <- data.frame(npp,npa)
out_res<-rbind (out_res,N)
}
### combine the outcome of Estep with present and absent count
data_expect<-cbind(out_res,count_present,count_absent)
colnames(data_expect)<-c("npp","npa","count_present","count_absent")
### step M (loop through all genes)
out_res_m <- NULL
for (i in 1:nrow(data_expect)){
#ABOmaximize <- function(npp,npa,naa) {
n <-8
p <- ((2*data_expect[i,1]) + data_expect[i,2])/ (2 * n)
a <- ((2*data_expect[i,4]) + data_expect[i,2])/ (2 * n)
L <- data.frame(p,a)
out_res_m<-rbind (out_res_m,L)
}
## combine out_res_m with data expect to make a final data set
data_max<-cbind(out_res_m,data_expect)
head(data_mx)
gene_iD. Present_count absent_count app naa p a
Ha1_00044004 8 0 2.666667 5.333333 0.66666667 0.3333333
Ha1_00043269 7 1 2.333333 4.666667 0.58333333 0.4166667
Ha1_00045379 8 0 2.666667 5.333333 0.66666667 0.3333333
.
.
.
The final part that I am stuck is the iteration part, repeat the process until convergence. If I had only one gene I could use something like this :
iter <- 1
diff <- 1
tol <- 0.0001 ## tolerance
while (diff > tol) {
E <- ABOexpect(np,mu.p,mu.a)
M <- ABOmaximize(E$pp,E$pa,np)
diff <- abs(M$mu.p - mu.p) + abs(M$mu.a - mu.a)
mu.a <- M$mu.p
mu.o <- M$mu.a
cat(sprintf("iter:%d, diff:%.2f\n",iter,diff))
iter <- iter + 1
}
cat(sprintf("mu_a=%.4f, mu_o=%.4f\n",mu.a,mu.o))
But my problem is that I have 40,000 genes. I do not know how can I modify the code above to get my desired out put!