Hello everyone, I have questions regarding how to design properly hierarchical clustering of time-course RNA-seq data. The data consist of seven time points and two replicates for each time point. I already performed data QC, DESeq2 counts normalization and DE analysis (based upon LRT). I utilize R as main programming language.
My aim is to predict putative transcription factors that may drive the upregulation of co-regulated genes. So the output I would like to get is simply the clusters for each gene, then visualizing trends and performing predictions of genes that have interesting patterns using tools like iRegulon
Thus, my idea is: 1. Subsetting genes of interest based upon GO terms and filtering for the ones that show a significant DE. 2. Creating the data matrix based on normalized counts processed by the vst function (with blind=TRUE). 3. Then applying the following:
dist_matrix<-Dist(vsd_matrix, method="pearson")
hc_genes_o<-hclust(dist_matrix, method="complete")
hc_genes_d<-as.dendrogram(hc_genes_o)
plot(hc_genes_d, cex = 0.6)
clust_pearson_genes<-cutree(hc_genes_o, h= ...)
Then creating a df that defines the clust for each gene. Creating the tidy version and apply ggplot2 visualization for having a quick idea of the trends (using geom_smooth).
My main questions are:
- How do I manage the presence of the replicate?
- Are there better methods for computing distances? (taking into consideration that trends may not be linear)
- Are there better strategies (packages / statistical approaches) for dissecting the trends?
- Are there other tools I could use instead of iRegulon?
Let's say the first one is my biggest problem! Thank you in advance for your attention.
Have a nice weekend, Daniele