I have made a database of genes from different organisms found as regulated by a specific protein complex through Chip-Seq data analysis.I also have a positive control for all these organisms(genes that have been found to be regulated by the complex) and a negative control for all the organisms (genes that based on a condition have nothing to do with the complex). I want to test the reliability of my analysis by finding the 4 fundamental numbers (TP,FP,TN,FN) and the rates-ratios (False positive rate etc).
With my "weak" testing statistics I thought something like that (please check the comments on my code so that you can understand) :
#make an empty matrix
mat2 <- matrix(NA, nrow = nrow(samples.annotationsnew), ncol =4,dimnames=list(samples.annotationsnew$SampleNo,c("TP","FP","TN","FN")))
#find the intersection between my database of genes and the positive control -> True positive
#the rest of the positive control will be the False Negative
for (i in (1:nrow(samples.annotationsnew))){
for (j in (1:length(pos_ctrls))){
if (samples.annotationsnew$ensembl.org[i]==names(pos_ctrls[j]) ){
mat2[i,1] <- length(intersect(genes2peaksnew[[i]]$feature,pos_ctrls[[j]][,3]))
mat2[i,4] <- length(pos_ctrls[[j]][,3]) - length(intersect(genes2peaksnew[[i]]$feature,pos_ctrls[[j]][,3]))
}}}
#find the intersection between my database of genes and the negative control -> False Positive
# the rest of the negative control will be the True Negative
for (i in (1:nrow(samples.annotationsnew))){
for (j in (1:length(neg_ctrls))){
if (samples.annotationsnew$ensembl.org[i]==names(neg_ctrls[j]) ){
mat2[i,2] <- length(intersect(genes2peaksnew[[i]]$feature,neg_ctrls[[j]][,9]))
mat2[i,3] <- length(neg_ctrls[[j]][,9]) - length(intersect(genes2peaksnew[[i]]$feature,neg_ctrls[[j]][,9]))
}}}
Is it realistic what I am doing or has nothing to do with True/False Postive/Negative condition testing?
Thanks in advance
Thanks for your quick answer,
As a neg_exp dataset (genes not regulated by the complex according to my experiment) could be the genes that i discarded during the Chip-seq data analysis by filtering, right?
Coming up with negatives is always difficult. The genes you are referring to, are these the ones that "genes that based on a condition have nothing to do with the complex" as you state? If so, it should be fine. The most important part is always writing up exactly what you consider your positive and negative set.
Correct, neg_exp will have to be those discarded or without peaks in your dataset. Note that your control dataset really needs to match the Chip-seq experimental conditions as closely as possible. Any biological change would make the TP/TN/FP/FN metrics meaningless.
Yes your point about the control dataset is totally correct! But the genes "without peaks" will be the rest of the organism's genes which i also find meaningless, so i will stick with the discarded ones! Thanks again