I have a matrix of Timepoint x Genes and I would like to cluster the genes in a specific manner. I want first to group the gene by their highest timepoint highest value, starting by the timepoint having the highest number of genes in its group. Then from each group I would like to group them again based on their second timepoint highest value (again starting from the timepoint having the highest number of genes in its group), and same for the 3rd column. The last ordering within each sub group should be done on the value of the third ordered timepoint.
I am working in R but I need help on the reasoning, so any language would fit :)
matrix_test <- matrix(c(
2.71232777,0.2991653,1.48852093,3.14893272,
3.67958385,0.8056344,4.26589876,2.00046755,
0.78076051,3.7899685,4.23125160,0.27269827,
3.41225770,1.6129989,1.88738310,3.21987395,
4.40389745,3.0922402,4.33817922,1.09906521,
1.83340066,2.9696409,2.68646770,0.44518642,
0.16507207,0.2406242,2.88768542,0.65375604,
2.59890027,1.9690600,0.07825978,4.39694255,
0.56357199,1.5579906,4.55631245,3.26020126,
4.51192639,2.1634021,3.19912875,0.14074029,
1.09985834,1.7974550,2.77024466,0.65531534,
3.11738251,3.7346642,3.40029681,0.78797571,
1.81569859,4.7622811,4.69980912,2.36738064,
3.94299927,1.0950355,0.91142844,3.75348940,
4.76819637,4.1336852,2.22382943,4.20141960,
3.43268648,0.1373782,1.49140684,4.37702780,
0.05495832,4.7384228,2.25986076,0.05534378,
4.78893397,4.8913683,4.55228674,0.20784590,
2.49051654,4.0678324,3.87029058,1.05679457,
2.70396726,3.3414759,3.40310146,3.63091426),nrow = 20,ncol = 4)
colnames(matrix_test) <- c("Ctrl", "Early", "Peak", "Late")
rownames(matrix_test) <- paste0('gene', seq(1:nrow(matrix_test)))
The result for my example should be :
gene20,gene18,gene11,gene16,gene17,gene3,gene13,gene14,gene1,gene4,gene6,gene10,gene9,gene8,gene7,gene5,gene19,gene2,gene12,gene15
If I break it down :
#PEAK
#LATE
#CTRL
gene20 1.0990652 0.14074029 4.2014196 3.63091426
gene18 3.0922402 2.16340210 4.1336852 3.34147590
gene11 4.2312516 0.07825978 4.6998091 4.55228674
#EARLY
#CTRL
gene16 3.2198740 3.26020126 3.7534894 1.05679457
gene17 4.4038974 4.51192639 4.7681964 2.70396726
#LATE
gene3 1.4885209 2.68646770 2.7702447 1.49140684
#CTRL
gene13 3.4122577 0.56357199 3.9429993 2.49051654
#LATE
#CRTL
#EARLY
gene14 1.6129989 1.55799060 1.0950355 4.06783240
gene1 2.7123278 1.83340066 1.0998583 3.43268648
#PEAK
gene4 3.1489327 0.44518642 0.6553153 4.37702780
#PEAK
#CTRL
gene6 0.8056344 0.24062420 3.7346642 4.73842280
gene10 3.7899685 1.96906000 4.7622811 4.89136830
#EARLY
gene9 0.7807605 2.59890027 1.8156986 4.78893397
#CTRL
#PEAK
#EARLY
gene8 2.0004676 0.65375604 0.7879757 0.05534378
gene7 4.2658988 2.88768542 3.4002968 2.25986076
#LATE
gene5 3.6795839 0.16507207 3.1173825 0.05495832
#LATE
gene19 4.3381792 3.19912875 2.2238294 3.40310146
#EARLY
#PEAK
#CTRL
gene2 0.2991653 2.96964090 1.7974550 0.13737820
gene12 0.2726983 4.39694255 2.3673806 0.20784590
#LATE
gene15 1.8873831 4.55631245 0.9114284 3.87029058
Could you explain the logic with an example, with gene20 and gene14?
Yes sure, the idea is to group genes by timepoint expression patterns. I am not really interested into the intensity of the expression.
From the whole list of genes, 7 genes have their highest expression at Peak (gene20), 6 genes at Late (gene14), 4 at Ctrl and 3 at Early (notice the order of the timepoint, from the highest number of genes found in each group).
Then for each subset :
I was trying to use a recursive function going down the logical tree but once I get into a leaf I have trouble going back to the last node with the updated tree and gene list. I also tried a dirty way with nested for loops but I am having a hard time recording the genes order in a nested list tree.