Hi everyone!
I have run some statistical tests that resulted in the data below. Now I just want to cluster column 2 based on the pvalues shown in column 3. I tried using mcl mcl input_Data.txt --abc -I 1.5 -o mclOutput
and the result was that Y1, Y2, and Y3 each form their own individual clusters, while Y4-Y11 formed one cluster together. I was expecting that Y1-Y3 would at least form a cluster together since their values are quite similar. Perhaps mcl is not the optimal clustering method for this kind of data? Any suggestions?
X1 Y1 9.98E-196
X1 Y2 6.88E-193
X1 Y3 3.32E-184
X1 Y4 0.000254
X1 Y5 0.00032
X1 Y6 0.000765
X1 Y7 0.00117
X1 Y8 0.00278
X1 Y9 0.0148
X1 Y10 0.0175
X1 Y11 0.0474
Thank you.
Well, the distance between Y1 and Y2 is larger than Y4-Y11 and Y2-Y3 is even larger so it makes sense. There isn't a lot of point in clustering using only one variable, you can instead divide them into groups arbitrarily.
Thanks, but I will be producing hundreds of these kind of tables, so I would prefer automating the clustering/diving into groups task.
Optimal granularity of a clustering is often in the eyes of the beholder. Anyway, I am curious as to what the purpose is. Since the magnitude of a p-value says nothing about what's been measured, p-values are not of much use for anything except to try and avoid false positives. Plot the ranked -log(p-values) and look at the shape of the curve, most likely you'll have a few extreme values and a very long low tail.
Thank you, basically the purpose is to assign the most representative Y for X1. I do not want to go with the smallest p-value because there are others such as Y2 and Y3 which are also extremely small and significant, so I am trying to group the Y's in some meaningful automated way. For example something that outputs: the most representative Y for X1 is Y1, Y2, and Y3.
The plot is here: https://imgur.com/u3RpJ4M Are you suggesting using a cutoff based on the density plot?
How do you define most representative? Typically, this could be the median. On the face of it, p-values are the wrong thing to use because they do not have a direct relation to the values that were tested and a low p-value indicates an extreme outcome (under the null hypothesis of the test). Of course I can be wrong here because I don't know the details of your data. In the linked density plot, the distribution is clearly bimodal so this could represent two clusters but my suggestion was to simply look at the values plotted in decreasing order.
I don't think you will be able to do this by clustering. If I understand your setup correctly, none of the Y variables have relationships to each other. That is, X1 is the only "hub" that indirectly connects Y variables. If so, even though extremely low p-values is what you want, they end up nullifying those edges between X1 and Ys.
You may want convert the third column into -log10(p-values). That way the more statistically significant Ys will at least be assigned stronger weights. Next, play with the inflation factor (-I) in a 0.5-8 range and see if that makes any difference. I suspect that all Y variables that have
-log10(p-values) > 0
will end up in the same cluster, which probably is not what you want.Thanks Mensur Dlakic ! I tried converting to -log10 and re-running, now all Y's are in the same cluster (regardless of -I inflation value) as you suspected.