Question

Consensus Cluster takes normalized gene counts or raw gene counts?

0

Entering edit mode

23 months ago

Manuel Sokolov Ravasqueira ▴ 110

Hi!

I am using package ConsensusClusterPlus in R to discover the optimal number of gene expression clusters. Following the steps [note: the code is pseudocode, just to help the understanding]:

1 Get the RNA SEQ data (rows: genes, cols: samples/patients)

2 Keep only the top 30% Most Variable Genes by MAD :

row_sds <-  apply(data, MARGIN = 1, mad) 
row_sds <- row_sds[order(row_sds, decreasing = TRUE)]
top_percentage <- 0.3
num_rows_to_keep <- ceiling(top_percentage * length(row_sds))
row_sds <- row_sds[1:num_rows_to_keep]
data <- data[names(row_sds), ]

3 Normalize expression per gene: sweep(data,1, apply(data,1,median,na.rm=T))

4 Apply method:

ConsensusClusterPlus(data.matrix(data),
  maxK=6,
  reps=50,
  pItem=0.8,
  pFeature=1,
  title=title,
  clusterAlg="hc",
  distance="pearson",
  seed=1262118388.71279,
  plot="png")

My question is on point 1 I have a normalized RNASEQ counts using Voom Limma pipeline (similar do DESEQ2) - normalizes data across samples enabling comparisons between samples. Should I pass to ConsensusPlus the RNASEQ counts or the normalized counts?

Best Regards and Thank you,
Manuel

R Gene-Expression ConsensusCluster DGE • 1.2k views

ADD COMMENT • link updated 23 months ago by bk11 ★ 3.1k • written 23 months ago by Manuel Sokolov Ravasqueira ▴ 110

score 2 · Answer 1 · 2023-09-08

You should pass normalized counts. In the ConsensusClusterPlus (Tutorial), you can see it is using ALL data from Package ‘ALL’. ALL Package contains the microarray data from 128 different individuals with acute lymphoblastic leukemia. These data have been normalized (using rma). Please check out page 2 of this file linked here-

https://bioconductor.org/packages/release/data/experiment/manuals/ALL/man/ALL.pdf