Question

How to make a PCA plot to check similar samples between two cohorts?

0

Entering edit mode

6.5 years ago

Vasu ▴ 790

I have 20 samples with gene intensities from Microarrays and same 20 samples with counts from RNA-Seq. I see some sample ids were mixed and want to make a PCA plot to check similar samples b/w Microarray and RNAseq cohorts.

I have the z-score data for Microarray and also z-score data for RNAseq. Not sure how to make a plot out of this data to check similar samples.

Can anyone tell me how to make a plot out of those data

Thank you

RNA-Seq microarray pca z-score • 3.8k views

ADD COMMENT • link updated 6.5 years ago by Kevin Blighe 88k • written 6.5 years ago by Vasu ▴ 790

score 0 · Answer 1 · 2018-05-26

0

Entering edit mode

6.5 years ago

Kevin Blighe 88k

Using base R functions, you can generate a PCA bi- and tri-plot. See here: A: PCA plot from read count matrix from RNA-Seq

Using Z-scores is indeed most likely your best chance in terms of 'merging' these datasets and having them on the same distribution. However, in the case of RNA-seq, it's important to understand how you initially normalised the data. What I would do is normalise via EdgeR or DESeq2 and then transform the normalised counts via the rlog function of DESeq2, followed by a further transformation to the Z-scale. If you've got FPKM counts, then there is already a R function / package, called zFPKM, which claims to be able to transform FPKM to the Z scale too.

Edit (May 27, 2018): whilst it may be an interesting exercise to merge these types of data together, the statistical implications are not sound. The data are measured and processed differently and suffer different types of biases. We can get them onto th same data distribution (log2, Z-scale, etc), but the differences between the individual units on that distribution may not be comparable.

Kevin

ADD COMMENT • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

In terms of RNAseq data, using edgeR I applied TMM normalization and with top highly variable genes I transformed the data into Z-scores. I did the same for Microarray data (for microarray data after background correction I used "normalize" function and then calculated z-score.

So, now do I need to merge both the data? Both datasets have same sample names right?

ADD REPLY • link 6.5 years ago by Vasu ▴ 790

0

Entering edit mode

Hey, well, microarray data analysis is quite standardised these days and your [microarray] data prior to Z-scale transformation should have been log2 expression values. When you say 'normalize' function, from which package is that?

From what I understand, EdgeR and DESeq2 normalise data to a Poisson and negative binomial distribution, respectively, the normalised counts of which may not be the best for direct transformation to the Z-scale. With your EdgeR normalised counts, you should first convert them to log2 CPM counts, as indicated in 'Section 2.15 Clustering, heatmaps etc' of the EdgeR vignette: edgeR: differential expression analysis of digital gene expression data.

If you do this (mentioned above), then you will be (in both cases) transforming log2 values to the Z-scale. You will still be criticised in some form if you ever try to publish this, because people criticise everything these days.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

normalize function is from oligo package.

So, for RNAseq - With the counts, I will do normalization and convert them to logCPM with this "logcpm <- cpm(y, prior.count=2, log=TRUE)" and from logCPM to Z-score.

For Microarray - gene intensities, I'm using "oligo" package - will do background correction then apply "normalize" on it and then apply log2() on it and then transform that to Z-score.

Please correct me if I'am wrong. And do I need to merge both dataset and then use it for PCA?

Lets say the merged datasets is "matrix"

project.pca <- prcomp(t(matrix))
summary(project.pca)

#Determine the proportion of variance of each component
#Proportion of variance equals (PC stdev^2) / (sum all PCs stdev^2)
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

par(mar=c(4,4,4,4), mfrow=c(1,3), cex=1.0, cex.main=0.8, cex.axis=0.8)

#Plots scatter plot for PC 1 and 2
plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"))
points(project.pca$x, col="black", pch=16, cex=1)

Do you think this code will be right?

ADD REPLY • link 6.5 years ago by Vasu ▴ 790

1

Entering edit mode

For Microarray - gene intensities, I'm using "oligo" package - will do background correction then apply "normalize" on it and then apply log2() on it and then transform that to Z-score.

No, for this, the rma() function from Oligo already produces log2 data.

So:

RNA-seq: EdgeR normalisation -> 'logcpm <- cpm(y, prior.count=2, log=TRUE)' -> Z-score transformation
Microarray: normalise with rma() -> Z-score transformation

For microarray, you can extract the log2 values with the exprs() function. For example:

microarray <- rma(rawdata)
log2Microarray <- exprs(microarray)

Hope that this helps. I am not sure how your final PCA will appear.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

If in doubt about data distributions, etc., get into the habit of checking the distribution via the hist() function. I can be very useful.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you. Sorry, I have a matrix with genes as rows and samples as columns with gene intensities.

I used this but have an error.

microarray <- rma(rawdata)
log2Microarray <- exprs(microarray)

Error in (function (classes, fdef, mtable)  : 
  unable to find an inherited method for function 'rma' for signature '"matrix"'

ADD REPLY • link 6.5 years ago by Vasu ▴ 790

0

Entering edit mode

Apologies, my example was more like pseudo-code just for narration. The rma() function of oligo accepts a list of CEL files, as follows:

library("oligo")

CELFiles <- list.celfiles("SampleFiles/", full.names = TRUE)
project <- read.celfiles(CELFiles)

#Background correct, normalize, and calculate gene expression
project.bgcorrect.norm.avg <- rma(project, target="probeset")

I assume that you are working from the raw intensity CEL files...?

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

No, I don't have CEL files. I have the microarray data in a matrix. Genenames as rows and samples as columns. rma() is not working with that.

And one more question - For clustering heatmap do I need to follow same steps like following?

RNA-seq: EdgeR normalisation -> 'logcpm <- cpm(y, prior.count=2, log=TRUE)' -> Z-score transformation
Microarray: normalise with rma() -> Z-score transformation

ADD REPLY • link 6.5 years ago by Vasu ▴ 790

0

Entering edit mode

Your microarray data is therefore most likely already on the log2 scale, i.e., if you have downloaded it from the Gene Expression Omnibus (GEO). Again, a quick check of the distribution should reveal this, e.g., boxplots and histograms.

For heatmaps, it would also be beneficial to be plotting the Z-scores, but ensure that any additional scaling in the heatmap function is switched off.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

I have also added a sort of disclaimer to my original answer, which you may want to take into account.

ADD REPLY • link 6.5 years ago by Kevin Blighe 88k

0

Entering edit mode

No, I'm not using any GEO data. It is my own data. As rma() is not working for the matrix, I used "normalize" function.

For heatmap, I'm using complexheatmap, there is nor scaling argument in "Heatmap" function

ADD REPLY • link 6.5 years ago by Vasu ▴ 790