Hi,
I'm posting here in hope that you guys may help me, since I have very little knowledge on the subject.
I'm researching on data regarding cancer cells and pharmaceuticals. For each pair (cell-compound) an IC50 value is given. My goal is to improve predictive algorithms for this type of values, having cells and compounds properties as input.
At this point I'm trying to figure out what are the best IC50 limit values so I can discretize the data for classification. I know what the values represent, but I'm having trouble deciding on what values should be thresholds (low and high, good and bad).
The max and min values of my dataset are (in microMolar):
Min IC50: 6.0694369235994496E-9
Max IC50: 4282296.54676787
This question cannot be answered by just looking at min/max values. What is the distribution of data points? Ideally there would be (usually) a lot of inactive values, and a some actives offset by a few log steps (depending on the screen design, obviously). Also, consider what is known about, for example, the targeted receptor (if there is one) and its binding characteristics, the analysis of related screens (same cell lines and different compounds, or related cells and same compound set...), etc. You definitely should look beyond the pure mathematics of your single data set.
express your IC50 values in M rather than mM as they seem to have a very wide range.
do you have a confidence interval for the IC50 values? If so this would be important. It depends on the assay but usually we think of IC50s in terms of log differences. So a drug with and IC50 of 10^-7M would be 'better' than a drug of -10^-6M. If the differences are half log or less than we don't consider the difference to be that dramatic.
the range of the data worth considering. Do the different drugs have values of say 5 logs or just a very small range?
How related are the chemical structures and how are you going to classify these? E.g. addition of a single carbon can change an IC50 significantly.
I don't know if this is sufficient to count as an answer but perhaps it helps a bit.
I've made previously an histogram of the IC50 (log form) distribution divided into 20 equal intervals: