I am clustering RNA-seq data into groups of similar expression patterns and visualizing the results. To this end, I have been:
1) Logging and normalizing the RNA-seq data with the following two methods (first is from edgeR package, second is from EDASeq package)
cpm.data.new <- cpm(data, TRUE, TRUE)
betweenLaneNormalization(cpm.data.new, which="full", round=FALSE)
2) Standardizing each gene to have a mean=0 and standard deviation=1.
3) Performing hierarchical clustering (from the stats package) using ward.D linkage
hclust(d, method="ward.D")
The resulting clusters look pretty clean when plotted. However, I was trying to determine if this is a recommended approach to hierarchical of clustering gene expression (to normalize, log, and standardize in this manner)? It is unclear to me if the literature has a recommended method for preparing RNA-seq for hierarchical clustering?
Thank you for sharing any advice or information you may have.
Hi Kevin. I'm new to this. Sorry if this question is a bit basic. When referring to "normal distribution", it means "all genes' expression in one sample" or "one gene's expression in all samples"?
Hello, by 'normal distribution', I mean this:
[souce: https://www.mathsisfun.com/data/standard-normal-distribution.html]
Logged and/or Z-scaled RNA-seq data should follow this distribution (but not always)
-------------------------------------
RNA-seq raw and nomalised data, however, follow the negative binomial distribution:
Thank you! I'm looking over your answers about PCA. They are very helpful, thanks again!