I am trying to generate some heatmaps for my RNA seq dataset but struggling to work out how to calculate the z score. Can anyone give me any pointers on how to calculate please. I have;
Gene_DE.txt file which includes; baseMean, logFC, lfcSE, stat, pvalue and FDR.
Also have featureCounts .txt which includes length and count data.
Any pointers would be amazing! Thank you
Thank you for your kind reply. I am using DESeq2, I will have a go with this! Thank you!
thank you ATpoint! I followed your post and I did get the scaled Z matrix. However, I failed to generate a heatmap from the Z matrix with code
pheatmap(Z, clustering_distance_rows = "correlation", clustering_distance_cols = "manhattan")
. The error message is that "Error in hclust(d, method = method) : NA/NaN/Inf in foreign function call (arg 10)". It seems in the Z matrix there are many rows with NA/NaN/Inf. I wonder if you have good suggestions to remove the NA/NaN/Inf from the rows of Z matrix? Is it better to remove these rows first from dds and then do the vst transformation and Z sacling?You probably have NaNs in there because some rows had the same value so Z goes NaN. Remove them, e.g. with
complete.cases()
Thank you very much ATpoint! I followed your suggestions and use
complete.cases()
to remove NaN from the Z matrix, and now the pheatmap works! It seems to me that the heatmap generated from the Z matrix contains all the genes, is there a way to generate a heatmap so that only part of the genes (say top 100 genes ranked with the lowest P-value)? I guess I need to first generate a Z matrix with only the preferable genes.Thank you for posting. I was searching for an answer on whether it is appropriate to use transformed counts to calculate Z-Score. I assume that rlog and vst transformed counts can both be used as input for the Z-Score calculation?
Yes, they're good and commonly-used choices as a) they're properly normalized and b) on log2 scale. By the way, my answer uses vst actually if you look closely.
Hi ATpoint, many thanks for your answer on here, it was extremely helpful. I understand this is an old post, but I had a follow up question if that's ok? I have calculated z scores using the code above. However, is it possible to now convert these z scores so they represent the expression level on a case/control basis rather than for each sample?
I understand that if I were to average the z scores for each gene across cases and controls then they would no longer have a mean of 0 and SD of 1 etc. However, if I were to then run the scale function on these new composite scores, would that give me what I am looking for?
Thanks again!