Hi everybody.
As a newbie to bioinformatics, it is not uncommon to find difficulties in the way biological knowledge mixes with statistics. I come from the Machine Learning field, and usually have problems with the naming conventions (well, among several other things, I must admit). Besides, I am not an expert in statistics, having used the barely necessary for the validation of my work.
Well, let's try to be more precise. One of the topics I am working more right now is the analysis of methylation array data. As you surely now, the final processed (and normalized) beta values are presented in a pxn matrix, where there are p different probes and n different samples or individuals from which we have obtained the beta-values. I am not currently working with the raw data.
Imagine, for a moment, that we have identified two regions of probes, A and B, with a group of nA probes belonging to A, another group (of nB probes) that belongs to B, and the intersection is empty. Say that we want to find a way to show there is a statistically significant difference between the methylation values of both regions.
As far as I have seen in the literature, comparisons (statistical tests) are always done comparing the same probe values between case and control groups of individuals or samples. For example, when we are trying to find differentiated probes.
However, if I think of directly comparing all the beta values from region A (nA * n values) against the ones in region B (nB * n values) with a, say, t test, I get the suspicion that something is not being done the way it should. My knowledge of Biology and Statistics is still limited and I cannot explain why, but I have the feeling that there is something formally wrong in this approximation. Am I right?
What I have done in similar experiments has been to find differentiated probes, and then do a test to the proportion of differentiated probes to total number of them, so I could assign a p-value to prove that there was a significant influence of the region of reference.
Several questions here: which could be a coherent approximation to the regions A and B problem stated above? Is there any problem with methylation data I am not aware of which makes only the in-probe analysis valid? Any bibliographic references that could help me seeing the subtleties around?
As you can see, concepts are quite interleaved in my mind, so any help would be very appreciated.
Regards, Gustavo
Hi Leonor.
First of all, thank you for your kind reply. I think the main problem I am having is related to the fact that is difficult for me to put down in words what I really intend to say. That's why I like places like Biostars or StackOverflow, because they let me try to define this thought problems through the use of written dialogue. This is to say I am not uncomfortable with the t-test. Actually, I think that, since I jumped from ML, we (the test and me) have developed a good and respectful relationship. ;)
(Let's head on to the problem, Gus). Well, if I am trying to see if two samples of beta values coming from the same probe are significantly differenced, I do not have any thought problem, since I think of the beta values from a single probe as a marginal distribution from the general, multivariate an unknown one from which we are sampling our data. In that case, I am making inferences between subsamples of the same sample, both of them obtained according to a given criterion (for example, the typical classification problem between control and cancer samples). Talking informally, I think of this as "comparing by rows".
I do have problems instead when, as I stated above, I have regions defined over probes. I think that is because of my view as marginal distributions. Imagine that I have different measures of a body: arm length, leg length, etc. For me, these are the probes equivalents, so I do not have problems comparing between arm lengths, but I do have them if I am thinking about comparing arm lengths and leg lengths. More informally, "comparing by columns" seems strange to me.
If I understood you, you are telling me that, given that the regions share no probes, we could consider them independent. Even if they comprise values coming from the same individual. Can we do that? That is the most difficult point for me to understand, because, as the columns in the beta values in regions A and B stand for paired individuals (for each individual there is both a column of data in region A and B), I really have difficulty for considering them independent.
Your point of view on the power of the proportion test (the last paragraph) was very inspiring. I did not think about it that way, and know I think you are completely right. :)
This is quite a long comment... But let's try to stay focused on the question.
You say that comparing probes from different regions would seem like comparing arm lengths and leg lengths. I do not agree, and this is actually one of the basis of µarray analyses (comparing different probes to get the most DE ones and then work on these regions). From the conceptual point of view, arm lengths and leg lengths are not homogeneous (legs are always longer than arms independently of most biological factors). However, probes are homogeneous: there is no reason (outside the biological effect you are looking for) to expect probeX to have a larger value than probeY.
The other point you raise concerns the independance between probes given that they all come from the same individuals. I think again, you are over-complicating the problem, and not raising the correct questions. Dependance would be if region A was a repetition of region B, or if they had some overlap. This is not the case here.
Thank you again, Leonor. I am sorry for the length of my previous comment, but I guess that is just inversely proportional to my knowledge about the problem. ;)
I think homogeneity is the key for me to understand it. Correct me if I'm wrong. It is not something general, it depends on the real problem and, for this one, our variables are homogeneous, so we can compare them. Am I right?
With respect to the other point, I think I understand your point of view, but I am less sure than in the previous one. What if some probes in region B are correlated to some in region A? Could we be talking then about dependent variables?
By the way, I am going to print your previous comment and put it on the wall just in front of me. It was very clear and precise. Thank you. I really mean it. :)
homogeneity: this always depends on your problem, but the main question you should ask yourself is: are X and Y comparable or are there intrinsic differences other than the biological effect I'm trying to detect?
independance: again, here, are A and B correlated by some factor than has no biological meaning (repetition, overlap, ...) or are they correlated by the biological effect you are working on (same transcription factor affecting their expression, or something else for methylation)? If the first is correct, you have a problem, if the second is correct, you might have a nice result.
poster on wall: well... this is not quite what I was looking for, but if it helps.