Question

How to work out Z score for heatmaps for RNA seq dataset

1

Entering edit mode

3.9 years ago

v.johnson ▴ 30

I am trying to generate some heatmaps for my RNA seq dataset but struggling to work out how to calculate the z score. Can anyone give me any pointers on how to calculate please. I have;

Gene_DE.txt file which includes; baseMean, logFC, lfcSE, stat, pvalue and FDR.

Also have featureCounts .txt which includes length and count data.

Any pointers would be amazing! Thank you

R • 18k views

ADD COMMENT • link updated 10 months ago by robert.flynn.21 ▴ 10 • written 3.9 years ago by v.johnson ▴ 30

1

Entering edit mode

3.9 years ago

Rashedul Islam ▴ 480

You can use 'scale' option of pheatmap package in R to use Z-score in heatmap.

ADD COMMENT • link 3.9 years ago by Rashedul Islam ▴ 480

score 12 · Accepted Answer · 2021-05-21

12

Entering edit mode

3.9 years ago

ATpoint 87k

Usually people plot the normalzed counts on the log2 scale, for the differential genes, and these converted to Z scale:

You seem to use DESeq2, so one could directly use the output from vst transformation:

vsd <- assay(vst(dds))
Z <- t(scale(t(vsd)))

So what does that do? We apply scale which is the actual Z-scoring function. In R data.frames and matrices are basically lists of vectors (every column is a vector) but we want rowwise (per gene) Z-scores. scale by default (like any operation on matrices / dfs) operates column-wise though, so we first transpose the matrix, then scale it, and then transpose it back so we again have a column=sample and row=gene matrix (or data.frame). That's it.

Then subset Z to the genes you want to plot and make the heatmap.

Does that make sense to you?

ADD COMMENT • link 3.9 years ago by ATpoint 87k

0

Entering edit mode

Thank you for your kind reply. I am using DESeq2, I will have a go with this! Thank you!

ADD REPLY • link 3.9 years ago by v.johnson ▴ 30

0

Entering edit mode

thank you ATpoint! I followed your post and I did get the scaled Z matrix. However, I failed to generate a heatmap from the Z matrix with code pheatmap(Z, clustering_distance_rows = "correlation", clustering_distance_cols = "manhattan"). The error message is that "Error in hclust(d, method = method) : NA/NaN/Inf in foreign function call (arg 10)". It seems in the Z matrix there are many rows with NA/NaN/Inf. I wonder if you have good suggestions to remove the NA/NaN/Inf from the rows of Z matrix? Is it better to remove these rows first from dds and then do the vst transformation and Z sacling?

ADD REPLY • link 3.5 years ago by xiaoleiusc ▴ 140

1

Entering edit mode

You probably have NaNs in there because some rows had the same value so Z goes NaN. Remove them, e.g. with complete.cases()

ADD REPLY • link 3.5 years ago by ATpoint 87k

0

Entering edit mode

Thank you very much ATpoint! I followed your suggestions and use complete.cases() to remove NaN from the Z matrix, and now the pheatmap works! It seems to me that the heatmap generated from the Z matrix contains all the genes, is there a way to generate a heatmap so that only part of the genes (say top 100 genes ranked with the lowest P-value)? I guess I need to first generate a Z matrix with only the preferable genes.

ADD REPLY • link 3.5 years ago by xiaoleiusc ▴ 140

0

Entering edit mode

Thank you for posting. I was searching for an answer on whether it is appropriate to use transformed counts to calculate Z-Score. I assume that rlog and vst transformed counts can both be used as input for the Z-Score calculation?

ADD REPLY • link 2.3 years ago by vanbelj ▴ 40

1

Entering edit mode

Yes, they're good and commonly-used choices as a) they're properly normalized and b) on log2 scale. By the way, my answer uses vst actually if you look closely.

ADD REPLY • link 2.3 years ago by ATpoint 87k

0

Entering edit mode

Hi ATpoint, many thanks for your answer on here, it was extremely helpful. I understand this is an old post, but I had a follow up question if that's ok? I have calculated z scores using the code above. However, is it possible to now convert these z scores so they represent the expression level on a case/control basis rather than for each sample?

I understand that if I were to average the z scores for each gene across cases and controls then they would no longer have a mean of 0 and SD of 1 etc. However, if I were to then run the scale function on these new composite scores, would that give me what I am looking for?

Thanks again!

ADD REPLY • link 10 months ago by robert.flynn.21 ▴ 10