I have 20 samples with gene intensities from Microarrays and same 20 samples with counts from RNA-Seq. I see some sample ids were mixed and want to make a PCA plot to check similar samples b/w Microarray and RNAseq cohorts.
I have the z-score data for Microarray and also z-score data for RNAseq. Not sure how to make a plot out of this data to check similar samples.
Can anyone tell me how to make a plot out of those data
Thank you
Hi Kevin,
In terms of RNAseq data, using edgeR I applied TMM normalization and with top highly variable genes I transformed the data into Z-scores. I did the same for Microarray data (for microarray data after background correction I used "normalize" function and then calculated z-score.
So, now do I need to merge both the data? Both datasets have same sample names right?
Hey, well, microarray data analysis is quite standardised these days and your [microarray] data prior to Z-scale transformation should have been log2 expression values. When you say 'normalize' function, from which package is that?
From what I understand, EdgeR and DESeq2 normalise data to a Poisson and negative binomial distribution, respectively, the normalised counts of which may not be the best for direct transformation to the Z-scale. With your EdgeR normalised counts, you should first convert them to log2 CPM counts, as indicated in 'Section 2.15 Clustering, heatmaps etc' of the EdgeR vignette: edgeR: differential expression analysis of digital gene expression data.
If you do this (mentioned above), then you will be (in both cases) transforming log2 values to the Z-scale. You will still be criticised in some form if you ever try to publish this, because people criticise everything these days.
normalize function is from oligo package.
So, for RNAseq - With the counts, I will do normalization and convert them to logCPM with this "logcpm <- cpm(y, prior.count=2, log=TRUE)" and from logCPM to Z-score.
For Microarray - gene intensities, I'm using "oligo" package - will do background correction then apply "normalize" on it and then apply log2() on it and then transform that to Z-score.
Please correct me if I'am wrong. And do I need to merge both dataset and then use it for PCA?
Lets say the merged datasets is "matrix"
Do you think this code will be right?
No, for this, the
rma()
function from Oligo already produces log2 data.So:
For microarray, you can extract the log2 values with the
exprs()
function. For example:Hope that this helps. I am not sure how your final PCA will appear.
If in doubt about data distributions, etc., get into the habit of checking the distribution via the
hist()
function. I can be very useful.Thank you. Sorry, I have a matrix with genes as rows and samples as columns with gene intensities.
I used this but have an error.
Apologies, my example was more like pseudo-code just for narration. The
rma()
function of oligo accepts a list of CEL files, as follows:I assume that you are working from the raw intensity CEL files...?
Hi Kevin,
No, I don't have CEL files. I have the microarray data in a matrix. Genenames as rows and samples as columns. rma() is not working with that.
And one more question - For clustering heatmap do I need to follow same steps like following?
Your microarray data is therefore most likely already on the log2 scale, i.e., if you have downloaded it from the Gene Expression Omnibus (GEO). Again, a quick check of the distribution should reveal this, e.g., boxplots and histograms.
For heatmaps, it would also be beneficial to be plotting the Z-scores, but ensure that any additional scaling in the heatmap function is switched off.
I have also added a sort of disclaimer to my original answer, which you may want to take into account.
No, I'm not using any GEO data. It is my own data. As rma() is not working for the matrix, I used "normalize" function.
For heatmap, I'm using complexheatmap, there is nor scaling argument in "Heatmap" function