Hi!
I am using package ConsensusClusterPlus in R to discover the optimal number of gene expression clusters. Following the steps [note: the code is pseudocode, just to help the understanding]:
1 Get the RNA SEQ data (rows: genes, cols: samples/patients)
2 Keep only the top 30% Most Variable Genes by MAD :
row_sds <- apply(data, MARGIN = 1, mad)
row_sds <- row_sds[order(row_sds, decreasing = TRUE)]
top_percentage <- 0.3
num_rows_to_keep <- ceiling(top_percentage * length(row_sds))
row_sds <- row_sds[1:num_rows_to_keep]
data <- data[names(row_sds), ]
3 Normalize expression per gene: sweep(data,1, apply(data,1,median,na.rm=T))
4 Apply method:
ConsensusClusterPlus(data.matrix(data),
maxK=6,
reps=50,
pItem=0.8,
pFeature=1,
title=title,
clusterAlg="hc",
distance="pearson",
seed=1262118388.71279,
plot="png")
My question is on point 1 I have a normalized RNASEQ counts using Voom Limma pipeline (similar do DESEQ2) - normalizes data across samples enabling comparisons between samples. Should I pass to ConsensusPlus the RNASEQ counts or the normalized counts?
Best Regards and Thank you,
Manuel