I have RNA-seq data (count table). I would like to draw heatmap for samples and clustering samples based on RPKM value. I would like to know what kind of RPKM output I should use as Input of Heatmap:
data = count table --> output of HTSeq
1) only normalized data like below:
data_list <- DGEList(counts = data, genes=data[1:3])
data_norm <- calcNormFactors(data_list)
RPKM <- rpkm (data_norm, data_norm$genes$gene_length)
2) normalized data after removing low expressed genes like below:
data_list <- DGEList(counts = data, genes=data[1:3])
data_filter <- rowSums(cpm(data) > 0.5) >=2
data_keep <- data_list[data_filter, ,keep.lib.sizes=FALSE]
data_keep_norm <- calcNormFactors(data_keep)
RPKM <- rpkm (data_keep_norm, data_keep_norm$genes$gene_length)
Also, I would like to know which scaling is more suitable for considering in heatmap:
3) log2(RPKM + 0.1)
4) Z-score(RPKM) using zFPKM packages
5) Row Z-score (using the option of scale= 'row', heatmap.2)
There is no standard. Use whichever. My preference would be zFPKM output, and to then switch off additional scaling in the heatmap function.
How does zFPKM compare with TPM and UQ-normalized raw counts in terms of inter-sample comparability on a heatmap?
Also, OP, use ComplexHeatmap instead of heatmap.2 if you have the freedom to do that. It's much easier to add features going forward.
I happened to be in multiple conversations with the main developer behind zFPKM relatively recently, and there is much science behind the method, which gave me much faith in using this if only presented with FPKM or RPKM. It calculates the Z-scores from R/FPKM based on empirical evidence deriving from this study: https://www.ncbi.nlm.nih.gov/pubmed/24215113
I don't know how it fairs against TPM and FPKM-UQ though