Question

Home made clustering

1

Entering edit mode

14 months ago

Bastien Hervé 6.4k

I have a matrix of Timepoint x Genes and I would like to cluster the genes in a specific manner. I want first to group the gene by their highest timepoint highest value, starting by the timepoint having the highest number of genes in its group. Then from each group I would like to group them again based on their second timepoint highest value (again starting from the timepoint having the highest number of genes in its group), and same for the 3rd column. The last ordering within each sub group should be done on the value of the third ordered timepoint.

I am working in R but I need help on the reasoning, so any language would fit :)

matrix_test <- matrix(c(
2.71232777,0.2991653,1.48852093,3.14893272,
3.67958385,0.8056344,4.26589876,2.00046755,
0.78076051,3.7899685,4.23125160,0.27269827,
3.41225770,1.6129989,1.88738310,3.21987395,
4.40389745,3.0922402,4.33817922,1.09906521,
1.83340066,2.9696409,2.68646770,0.44518642,
0.16507207,0.2406242,2.88768542,0.65375604,
2.59890027,1.9690600,0.07825978,4.39694255,
0.56357199,1.5579906,4.55631245,3.26020126,
4.51192639,2.1634021,3.19912875,0.14074029,
1.09985834,1.7974550,2.77024466,0.65531534,
3.11738251,3.7346642,3.40029681,0.78797571,
1.81569859,4.7622811,4.69980912,2.36738064,
3.94299927,1.0950355,0.91142844,3.75348940,
4.76819637,4.1336852,2.22382943,4.20141960,
3.43268648,0.1373782,1.49140684,4.37702780,
0.05495832,4.7384228,2.25986076,0.05534378,
4.78893397,4.8913683,4.55228674,0.20784590,
2.49051654,4.0678324,3.87029058,1.05679457,
2.70396726,3.3414759,3.40310146,3.63091426),nrow = 20,ncol = 4)

colnames(matrix_test) <- c("Ctrl", "Early", "Peak", "Late")
rownames(matrix_test) <- paste0('gene', seq(1:nrow(matrix_test)))

The result for my example should be :

gene20,gene18,gene11,gene16,gene17,gene3,gene13,gene14,gene1,gene4,gene6,gene10,gene9,gene8,gene7,gene5,gene19,gene2,gene12,gene15

If I break it down :

#PEAK
    #LATE
        #CTRL
        gene20  1.0990652   0.14074029  4.2014196   3.63091426
        gene18  3.0922402   2.16340210  4.1336852   3.34147590
        gene11  4.2312516   0.07825978  4.6998091   4.55228674
    #EARLY
        #CTRL
        gene16  3.2198740   3.26020126  3.7534894   1.05679457
        gene17  4.4038974   4.51192639  4.7681964   2.70396726
        #LATE
        gene3   1.4885209   2.68646770  2.7702447   1.49140684
    #CTRL
    gene13  3.4122577   0.56357199  3.9429993   2.49051654
#LATE
    #CRTL
        #EARLY
        gene14  1.6129989   1.55799060  1.0950355   4.06783240
        gene1   2.7123278   1.83340066  1.0998583   3.43268648
        #PEAK
        gene4   3.1489327   0.44518642  0.6553153   4.37702780
    #PEAK
        #CTRL
        gene6   0.8056344   0.24062420  3.7346642   4.73842280
        gene10  3.7899685   1.96906000  4.7622811   4.89136830
    #EARLY
    gene9   0.7807605   2.59890027  1.8156986   4.78893397

#CTRL
    #PEAK  
        #EARLY
        gene8   2.0004676   0.65375604  0.7879757   0.05534378
        gene7   4.2658988   2.88768542  3.4002968   2.25986076
        #LATE
        gene5   3.6795839   0.16507207  3.1173825   0.05495832
    #LATE
    gene19  4.3381792   3.19912875  2.2238294   3.40310146

#EARLY
    #PEAK
        #CTRL
        gene2   0.2991653   2.96964090  1.7974550   0.13737820
        gene12  0.2726983   4.39694255  2.3673806   0.20784590
    #LATE
    gene15  1.8873831   4.55631245  0.9114284   3.87029058

R clustering • 1.3k views

ADD COMMENT • link 14 months ago by Bastien Hervé 6.4k

0

Entering edit mode

Could you explain the logic with an example, with gene20 and gene14?

ADD REPLY • link 14 months ago by zx8754 12k

1

Entering edit mode

Yes sure, the idea is to group genes by timepoint expression patterns. I am not really interested into the intensity of the expression.

From the whole list of genes, 7 genes have their highest expression at Peak (gene20), 6 genes at Late (gene14), 4 at Ctrl and 3 at Early (notice the order of the timepoint, from the highest number of genes found in each group).
Then for each subset :
- Let say the Peak (7 genes) : 3 genes have their highest expression at Late (gene20), 3 at Early and 1 at Ctrl
- Let say the Late (6 genes) : 3 genes have their highest expression at Ctrl (gene14), 2 at Peak and 1 at Early

Then for each sub-subset :
- Let say the Peak -> Late (3 genes) : 3 genes have their highest expression at Ctrl (gene20), and that is it. I sort according to the Ctrl value.
- Let say the Late -> Ctrl (3 genes) : 2 genes have their highest expression at Early (gene14) and 1 at Peak. I sort according to the Early value.

ADD REPLY • link 14 months ago by Bastien Hervé 6.4k

0

Entering edit mode

I was trying to use a recursive function going down the logical tree but once I get into a leaf I have trouble going back to the last node with the updated tree and gene list. I also tried a dirty way with nested for loops but I am having a hard time recording the genes order in a nested list tree.

ADD REPLY • link 14 months ago by Bastien Hervé 6.4k

score 2 · Accepted Answer · 2024-06-07

Using recursive:

d <- data.frame(matrix_test)

f <- function(x){
  if(ncol(x) == 1 | nrow(x) == 1) { return(rownames(x)) 
  } else {
    xs <- split(x, apply(x, 1, \(i) colnames(x)[ which.max(i) ]))
    sapply(names(xs), function(i){ f( xs[[ i ]][, colnames(x) != i, drop = FALSE ]) }, 
           simplify = FALSE)
  }}

f(d)

#$Ctrl
#$Ctrl$Late
#[1] "gene19"
#
#$Ctrl$Peak
#$Ctrl$Peak$Early
#[1] "gene5" "gene7" "gene8"
#
#
#
#$Early
#$Early$Late
#[1] "gene15"
#
#$Early$Peak
#$Early$Peak$Ctrl
#[1] "gene2"  "gene12"
#
#
#
#$Late
#$Late$Ctrl
#$Late$Ctrl$Early
#[1] "gene1"  "gene14"
#
#$Late$Ctrl$Peak
#[1] "gene4"
#
#
#$Late$Early
#[1] "gene9"
#
#$Late$Peak
#$Late$Peak$Ctrl
#[1] "gene6"  "gene10"
#
#
#
#$Peak
#$Peak$Ctrl
#[1] "gene13"
#
#$Peak$Early
#$Peak$Early$Ctrl
#[1] "gene16" "gene17"
#
#$Peak$Early$Late
#[1] "gene3"
#
#
#$Peak$Late
#$Peak$Late$Ctrl
#[1] "gene11" "gene18" "gene20"