Complex clustering of RNA-seq data
1
0
Entering edit mode
9.7 years ago
Yrinky • 0

Even though I'm working with RNA-seq data, my question is more of a statistics and machine learning nature. I do all my work in R.

So, what I have is expression data from wild types (controls) and mutants over four different conditions, where a condition is a combination of two cell types and time points. Basically, my expression matrix looks like this:

         time1_loc1_control    time1_loc1_mutant    time1_loc2_control    time1_loc2_mutant
gene1
gene2
..
..

Expression values are initially in counts-per-million, but I have tried using different transformations before clustering attempts.

What I am trying to achieve is to cluster the genes based on the direction of the change between mutant and control (upregulated or downregulated) and absolute values of expression.

So far I was able to roughly group the genes based solely on the direction of the change, but I would also need to retain the information of absolute expression values. Are there any methods that could help me with this? This would be something like combination of quantitative and categorical data. I tried using daisy(), but it didn't seem to do what I'm trying to achieve.

One other idea was to split the dataset by conditions (as conditions are independent) and cluster the genes separately. This would mean each gene would be clustered four times. Is there any way to determine which genes are being clustered together?

Thank you for any input on this. I realize it's a bit vague but I am not extremely experienced with this

Cheers!

RNA-Seq R clustering • 2.3k views
ADD COMMENT
0
Entering edit mode

It seems to me that you are looking for a (semi-)supervised biclustering approach. To perform biclustering In R you could use the Iterative Signature Algorithm (ISA) by using the package isa2 (developed for microarray data), but I am not aware of any approach available in R to add a priori knowledge. However, some approaches have been described in literature.

ADD REPLY
0
Entering edit mode

Thank you for the input, I will look into it!

ADD REPLY
0
Entering edit mode
9.7 years ago

You could try representing your data as a three way array (e.g. genes x cells x time points) and do a PARAFAC/CANDECOMP tensor factorization. This is implemented in R in the PTAk package.

ADD COMMENT

Login before adding your answer.

Traffic: 4548 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6