Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature. I do all my work in R.
So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:
time1_loc1_control time1_loc1_mutant time1_loc2_control time1_loc2_mutant
gene1
gene2
..
..
Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.
What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.
So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this? This would be something like combination of quantitative and categorical data. I tried using daisy()
, but it didn't seem to do what I'm trying to achieve.
One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?
Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this
Cheers!
It seems to me that you are looking for a (semi-)supervised biclustering approach. To perform biclustering In R you could use the Iterative Signature Algorithm (ISA) by using the package isa2 (developed for microarray data), but I am not aware of any approach available in R to add a priori knowledge. However, some approaches have been described in literature.
Thank you for the input, I will look into it!