Perform PCA / single value decomposition

Question

PCA plot from read count matrix from RNA-Seq

15

Entering edit mode

7.1 years ago

rachel.kubik12 ▴ 220

Hello!

I have RNA-Seq data with 32 samples. I used Salmon to generate a read count matrix. I want to generate a PCA plot to look at the relationship between my samples. How do I use the read count matrix to perform the PCA analysis?

Thanks!

RNA-Seq pca • 57k views

ADD COMMENT • link updated 3.1 years ago by BioNovice247 ▴ 20 • written 7.1 years ago by rachel.kubik12 ▴ 220

0

Entering edit mode

Hi Kevin, Thanks! Can you please tell me how can I color two groups, one in green and other in red?

ADD REPLY • link 5.5 years ago by mayankkaashyap ▴ 30

0

Entering edit mode

Hey, this is now a Bioconductor package. So, you should be able to colour the groups easily by using that: https://github.com/kevinblighe/PCAtools ?

ADD REPLY • link 5.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, Thanks for your reply. I can't install PCAtools in R: it download yet when I open the package it says not found. I have 46 colums with Leaf and fruit samples and rows are genes. I don't know how to plot PCA and color Leaf and fruit. Is there a simple example I can understand with. Sorry I am beginner and trying to learn. Thanks

ADD REPLY • link 5.5 years ago by mayankkaashyap ▴ 30

0

Entering edit mode

Can you show the exact commands that you used to try to run PCAtools, and the exact output error messages?

If you also paste a sample of your data and metadata, then I can walk you through it.

ADD REPLY • link 5.5 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, Thanks so much for you kind help! I installed and types following:

trying URL 'https://cran.rstudio.com/bin/macosx/el-capitan/contrib/3.5/xtable_1.8-4.tgz' Content type 'application/x-gzip' length 699599 bytes (683 KB) ================================================== downloaded 683 KB

The downloaded binary packages are in /var/folders/32/g4_njl8j7f7867s941883mgcc9wj_p/T//Rtmp5XNdUM/downloaded_packages Warning message: package ‘PCAtools’ is not available (for R version 3.5.1) devtools::install_github('kevinblighe/PCAtools') Error in loadNamespace(name) : there is no package called ‘devtools’ library(PCAtools) Error in library(PCAtools) : there is no package called ‘PCAtools’ library(PCAtools) Error in library(PCAtools) : there is no package called ‘PCAtools’

My data files is as follows: Where: Column 1 is tissue types and Row1 is genes. I have 46 tissue type and 1000 genes. I want to draw a PCA plot showing all leaf samples in green and all fruit samples in red!!

   Solyc00g005000 Solyc00g005040 Solyc00g005050 Solyc00g005060 Solyc00g005080 Solyc00g005090
  
Leaf1 0.00000000 Leaf2 0.00000000 Leaf3 0.00000000 Leaf4 0.00000000 Leaf5 0.00000000 Leaf6 0.00000000 Fruit1 0.00000000 Fruit2 0.00000000 Fruit3 0.00000000 Fruit4 0.00000000 Fruit5 0.00000000 Fruit6 0.00000000 Fruit7 0.00000000 Fruit8 0.00000000 Fruit9 0.00000000 Fruit10 0.00000000 Fruit11 0.00000000 Fruit12 0.00000000 Fruit13 0.00000000 Fruit14 0.05333333 Fruit15 0.02666667 Fruit16 0.06000000 Thanks ! 2.97000000 18.58000 0 0.00000000 0 0.00000000 20.83000 0 0.00000000 0 0.25000000 12.20000 0 0.00000000 0 0.00000000 7.62000 0 0.00000000 0 0.00000000 18.00000 0 0.00000000 0 1.33000000 21.58000 0 0.00000000 0 0.00000000 13.12667 0 0.00000000 0 0.24666667 8.64000 0 0.00000000 0 0.06666667 10.58000 0 0.00000000 0 0.70666667 37.30000 0 0.00000000 0 0.02666667 45.96667 0 0.00000000 0 0.08333333 24.34333 0 0.00000000 0 0.28000000 20.16333 0 0.00000000 0 0.34333333 27.76667 0 0.00000000 0 0.15333333 15.42667 0 0.03666667 0 0.10666667 12.86667 0 0.00000000 0 0.25666667 16.36000 0 0.03666667 0 1.03333333 40.36000 0 0.04333333 0 0.58000000 57.46000 0 0.00000000 0 0.16333333 29.14333 0 0.00000000 0 0.06000000 26.37667 0 0.00000000 0 0.18000000 49.53000 0 0.00000000 0

ADD REPLY • link 5.5 years ago by mayankkaashyap ▴ 30

0

Entering edit mode

Solyc00g005000 Solyc00g005040 Solyc00g005050 Solyc00g005060 Solyc00g005080 Solyc00g005090

Leaf1 0.00000000 2.97000000 18.58000 0 0.00000000 0

Leaf2 0.00000000 0.00000000 20.83000 0 0.00000000 0

Leaf3 0.00000000 0.25000000 12.20000 0 0.00000000 0

Leaf4 0.00000000 0.00000000 7.62000 0 0.00000000 0

Leaf5 0.00000000 0.00000000 18.00000 0 0.00000000 0

Leaf6 0.00000000 1.33000000 21.58000 0 0.00000000 0

Fruit1 0.00000000 0.00000000 13.12667 0 0.00000000 0

Fruit2 0.00000000 0.24666667 8.64000 0 0.00000000 0

Fruit3 0.00000000 0.06666667 10.58000 0 0.00000000 0

Fruit4 0.00000000 0.70666667 37.30000 0 0.00000000 0

Fruit5 0.00000000 0.02666667 45.96667 0 0.00000000 0

Fruit6 0.00000000 0.08333333 24.34333 0 0.00000000 0

Fruit7 0.00000000 0.28000000 20.16333 0 0.00000000 0

Fruit8 0.00000000 0.34333333 27.76667 0 0.00000000 0

Fruit9 0.00000000 0.15333333 15.42667 0 0.03666667 0

Fruit10 0.00000000 0.10666667 12.86667 0 0.00000000 0

Fruit11 0.00000000 0.25666667 16.36000 0 0.03666667 0

Fruit12 0.00000000 1.03333333 40.36000 0 0.04333333 0

Fruit13 0.00000000 0.58000000 57.46000 0 0.00000000 0

Fruit14 0.05333333 0.16333333 29.14333 0 0.00000000 0

Fruit15 0.02666667 0.06000000 26.37667 0 0.00000000 0

Fruit16 0.06000000 0.18000000 49.53000 0 0.00000000 0

ADD REPLY • link 5.5 years ago by mayankkaashyap ▴ 30

1

Entering edit mode

Ah, I see, you will require R 3.6 for PCAtools - sorry. For now, you can still use base R functions. Here is an example with the data that you have given:

View data

head(data)
      Solyc00g005000 Solyc00g005040 Solyc00g005050 Solyc00g005060
Leaf1              0           2.97          18.58              0
Leaf2              0           0.00          20.83              0
Leaf3              0           0.25          12.20              0
Leaf4              0           0.00           7.62              0
Leaf5              0           0.00          18.00              0
Leaf6              0           1.33          21.58              0
      Solyc00g005080 Solyc00g005090
Leaf1              0              0
Leaf2              0              0
Leaf3              0              0
Leaf4              0              0
Leaf5              0              0
Leaf6              0              0

Create colour vector and shape vector for leaf / fruit

type <- gsub("[0-9]*$", "", rownames(data))
type
 [1] "Leaf"  "Leaf"  "Leaf"  "Leaf"  "Leaf"  "Leaf"  "Fruit" "Fruit" "Fruit"
[10] "Fruit" "Fruit" "Fruit" "Fruit" "Fruit" "Fruit" "Fruit" "Fruit" "Fruit"
[19] "Fruit" "Fruit" "Fruit" "Fruit"

type <- factor(type, levels = c("Leaf", "Fruit"))
type
 [1] Leaf  Leaf  Leaf  Leaf  Leaf  Leaf  Fruit Fruit Fruit Fruit Fruit Fruit
[13] Fruit Fruit Fruit Fruit Fruit Fruit Fruit Fruit Fruit Fruit
Levels: Leaf Fruit

colType <- c("forestgreen", "red2")[type]
colType
 [1] "forestgreen" "forestgreen" "forestgreen" "forestgreen" "forestgreen"
 [6] "forestgreen" "red2"        "red2"        "red2"        "red2"       
[11] "red2"        "red2"        "red2"        "red2"        "red2"       
[16] "red2"        "red2"        "red2"        "red2"        "red2"       
[21] "red2"        "red2"

pchType <- c(18, 16)[type]
pchType
 [1] 18 18 18 18 18 18 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16

Perform PCA and plot PC1 vs. PC2

pca <- prcomp(data)
plot(
  pca$x,
  col = colType,
  pch = pchType,
  cex = 3.0)
legend(
  "bottomright",
  bty = "n",
  c("Leaf", "Fruit"),
  fill = c("forestgreen", "red2"),
  cex = 2.0)

ADD REPLY • link 5.5 years ago by Kevin Blighe 88k

3

Entering edit mode

Hi Kevin, Thanks so much! Don't have word to thank you!

ADD REPLY • link 5.5 years ago by mayankkaashyap ▴ 30

Ram · Answer 1 · 2017-11-09

65

Entering edit mode

7.1 years ago

Kevin Blighe 88k

NB - this is now a Bioconductor R package: https://github.com/kevinblighe/PCAtools

-------------------------

You should normalise your data prior to performing PCA. In the code below, you'll have to add plot legends yourself, and also colour vectors (passed to the 'col' parameter).

Then, assuming that you have transcripts as rows and samples as columns:

NB - in this code, the plots I've shown don't necessarily match the exact code, but the plot type is the same

[Edit: also take a look at my definition of PCA here: PCA in a RNA seq analysis]

Perform PCA / single value decomposition

project.pca <- prcomp(t(MyReadCountMatrix))
summary(project.pca)

#Determine the proportion of variance of each component
#Proportion of variance equals (PC stdev^2) / (sum all PCs stdev^2)
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

Scree plot

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

scree

Pairs plots

par(cex=1.0, cex.axis=0.8, cex.main=0.8)
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)
pairs(project.pca$x[,6:10], col="black", main="Principal components analysis bi-plot\nPCs 6-10", pch=16)

pairs

Bi-plots

par(mar=c(4,4,4,4), mfrow=c(1,3), cex=1.0, cex.main=0.8, cex.axis=0.8)

#Plots scatter plot for PC 1 and 2
plot(project.pca$x, type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"))
points(project.pca$x, col="black", pch=16, cex=1)

#Plots scatter plot for PC 1 and 3
plot(project.pca$x[,1], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,1], project.pca$x[,3], col="black", pch=16, cex=1)

#Plots scatter plot for PC 2 and 3
plot(project.pca$x[,2], project.pca$x[,3], type="n", main="Principal components analysis bi-plot", xlab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), ylab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"))
points(project.pca$x[,2], project.pca$x[,3], col="black", pch=16, cex=1)

Tri-plot

require(scatterplot3d)
par(mar=c(4,4,4,4), cex=1.0, cex.main=0.8, cex.axis=0.8)

scatterplot3d(project.pca$x[,1:3], angle=-40, main="", color="black", pch=17, xlab=paste("PC1, ", round(project.pca.proportionvariances[1], 2), "%"), ylab=paste("PC2, ", round(project.pca.proportionvariances[2], 2), "%"), zlab=paste("PC3, ", round(project.pca.proportionvariances[3], 2), "%"), grid=FALSE, box=FALSE)
source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))

source('http://www.sthda.com/sthda/RDoc/functions/addgrids3d.r')
addgrids3d(project.pca$x[,1:3], grid = c("xy", "xz", "yz"))

ADD COMMENT • link 5.8 years ago by Kevin Blighe 88k

1

Entering edit mode

@Kevin Blighe:

Thanks for your contribution to Biostars community, this thread could be a good tutorial for PCA analysis series.

ADD REPLY • link 5.5 years ago by Jurat Shahidin ▴ 100

0

Entering edit mode

Hi Kevin,

I also want to perform a PCA plot. I have a read-count matrix data. You mentioned that we need to normalize data prior to PCA plot. Could you please help me how to normalize the data?

ADD REPLY • link 7.1 years ago by Mehmet ▴ 820

0

Entering edit mode

Hi Mehmet, from where did you obtain the counts or how did you produce them?

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

I guess we obtain reads counts from HtSeq, and performed DEG analyses using edgeR. I have 23 samples. Previously, I generated a PCA plot using read.count.matrix data following your codes above. I also would like to ask you; what is difference between generating a PCA plot from PC1 to PC5, and PC1 to PC10?

ADD REPLY • link 7.1 years ago by Mehmet ▴ 820

1

Entering edit mode

Hi Mehmet, I have not used EdgeR (I prefer DESeq2), but PCA should be performed on normally-distributed counts, so, your normalised logged counts.

There is no difference between the generation of the pairs plots for PC1-5 and PC6-10. It is just a matter of fitting all of the plots on the same page. Technically, you can generate a pairs plot for PC1-PC100, but you would require a very large plotting window.

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, Thank you very much.

Should I do anything in my FPKM data prior to PCA plot?

I have run analyses to generate PCA plot (pairs plots). I would like to ask you a few things: 1. What are the dots in boxes? 2.What are the numbers around each boxes?

I have 17 samples in my data, but I have 15 dots in boxes in the plot.

could you please help me about how to interpret results ?

Thank you

ADD REPLY • link 7.1 years ago by Mehmet ▴ 820

0

Entering edit mode

If you have used HTSeq followed by EdgeR, then you should have logged counts(?). How does the distribution appear when you run: hist(MyData) ?

Can you post an image of the pairs plot and provide the code that you are using to execute the PCA?

You can upload figures/images here, and then copy/psate the URL here in your response.

Kevin

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

I have uploaded two figures.

my commands:

project.pca <- prcomp(t(MyReadCountMatrix))
project.pca.proportionvariances <- ((project.pca$sdev^2) / (sum(project.pca$sdev^2)))*100

barplot(project.pca.proportionvariances, cex.names=1, xlab=paste("Principal component (PC), 1-", length(project.pca$sdev)), ylab="Proportion of variation (%)", main="Scree plot", ylim=c(0,100))

par(cex=1.0, cex.axis=0.8, cex.main=0.8)    
pairs(project.pca$x[,1:5], col="black", main="Principal components analysis bi-plot\nPCs 1-5", pch=16)

ADD REPLY • link updated 6.4 years ago by Ram 44k • written 7.1 years ago by Mehmet ▴ 820

1

Entering edit mode

You need to post the URL's for the figures in your post.

ADD REPLY • link 7.1 years ago by GenoMax 148k

0

Entering edit mode

Sorry,

Please have a look

plot1

ADD REPLY • link updated 6.4 years ago by Ram 44k • written 7.1 years ago by Mehmet ▴ 820

0

Entering edit mode

Looks okay to me. In the second one, the title is incorrect though, as it's PC1-10. You could also make the dots smaller here, but it's not that important as these plots are mainly just for QC.

What's the variance explained on PC1 and 2?

Looks fine.

Edit: as you go further down along the PCs, you always find 'odd' plots like the one for PC10. Nothing to worry about, though. I'd worry if I saw that on PC1-3

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you for help.

I do not know variance between PC1 and PC2. When I run commands, I did not get variance value. I have 17 PCs when I type below:

    Standard deviations:
 [1] 3.49034964 1.22304723 1.05526472 0.99765777 0.76043084 0.56623044 0.37994839
 [8] 0.24585524 0.23052947 0.17215343 0.08918422 0.07063718 0.06521970 0.05890600
[15] 0.04826593 0.04446027 0.03575958

How to interpret the graphs? and what is the meaning of the dots in the boxes? As I have 17 PCs, should I show 17 PCs in the plot?

In some tutorials I saw that people write column names instead of PC1, PC2 .....

Should I use column names also?

There are several type of PCA plots. Which should I use in my paper?

Like this one:

https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot

Sorry for asking too many questions.

Thank you

ADD REPLY • link 7.1 years ago by Mehmet ▴ 820

1

Entering edit mode

The PCA bi-plot at https://www.bioconductor.org/help/workflows/rnaseqGene/#pca-plot is generated using the plotPCA function in DESeq2. It is usually used when processing RNA-seq data. However, this function is biased because it removes a large proportion of genes of low variance from the dataset prior to performing PCA, which defeats the purpose of conducting the PCA in the first place. My methods perform a completely unbiased PCA.

If you follow my code, the percentage of overall variance explained by each principal component will be held in project.pca.proportionvariances
Each dot is a sample.
You do not have to plot all PCs from 1-17. Most people only plot PC1 versus PC2. Look at my code at the top of this page under the heading 'Bi-plots'
PCA is primarily used for QC purposes to show that there are no outliers in the dataset. Looking at your PCA plots, there are no outliers. That is all you need to say. It is possible to go a lot more advanced into PCA, but, I would not recommend it unless you fully understand the PCA process

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin sorry for the random question, but would a PCA plot be useful to see the expression of different genotypes to a treatment. I have 5 genotypes and would like to see how the expression of each pair interact in a plot. Thanks

ADD REPLY • link 6.9 years ago by catagui ▴ 40

0

Entering edit mode

Hello, yes, it would be interesting to see that. You have RNA-seq data for these samples?

ADD REPLY • link 6.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Yes I do have 5 genotypes (3 samples per geno) and a control and treatment.

ADD REPLY • link 6.9 years ago by catagui ▴ 40

0

Entering edit mode

Just a quick additional question, I have normalized my counts using DESeq, and then log transformed (log2) my data, but do you do something about the zero's in the datasets when doing the log transformation? Since they will give a infinite negative value. Do you add something to those zero's? Also, do you do any removal of lowly expressed genes, nor not for this kind of analysis? Thanks!!

ADD REPLY • link 6.4 years ago by m.laarman ▴ 10

0

Entering edit mode

Oh, to manage those, you should use the regularised log transformation, the function for which is supplied with DESeq2: https://rdrr.io/bioc/DESeq2/man/rlog.html

To then access the r-logged counts, you use assay():

x <- rlog(dds)
assay(x)

PCA is then performed on these r-logged counts.

ADD REPLY • link 6.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, I'm sorry to be asking more questions, I'm totally new to this. I'm trying to use the code you sent me, but it's giving me errors:

"rlog() may take a long time with 50 or more samples, vst() is a much faster transformation Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are not integers"

I'm wondering if this is based on the way I normalized; I used a script from a colleague from DESeq (not DESeq2), using the following code:

conds <- factor(c(colnames(counts)))
cds <- newCountDataSet(counts, conds)
cds <- estimateSizeFactors(cds)

normalized_counts <- counts(cds, normalized=TRUE)

Do you have an example of code that could get me from raw counts to logged normalized counts to use in PCA?

If you need more info on my samples please let me know. Thanks!

ADD REPLY • link 6.4 years ago by m.laarman ▴ 10

1

Entering edit mode

Oh, that error is common. When the dataset is large, it is indeed better to run vst(), and then you access these again via the assay() function. These are perfectly fine in place of logged counts.

If you're starting from your own data matrix of raw counts (cts), then you just need:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
dds <- DESeq(dds)

normalized_counts <- counts(dds, normalized=TRUE)

Are you actually testing anything as in differential expression? Here, condition would be, like, the CaseControl variable.

Then, for vst():

vsd <- vst(dds, blind=TRUE)
assay(vsd)

Note that DESeq2 already has its own PCA implementation: https://bioconductor.org/packages/3.7/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#principal-component-plot-of-the-samples

If you have issues running DESeq2 itself, you may consider opening a new question

ADD REPLY • link 6.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks! I'm going to try around with DESeq2 and if I need more help I will open a new question.

Just one more question for now: do you know the MDS plot? I've used it during my analysis in edgeR, and do you know how this compares to a PCA plot? I've understood that also here the distance between the dots in the plot is a measure of distance between the datasets (Euclidean distance) (explanation in R: This function is a variation on the usual multdimensional scaling (or principle coordinate) plot, in that a distance measure particularly appropriate for the microarray context is used. The distance between each pair of samples (columns) is the root-mean-square deviation (Euclidean distance) for the top top genes. Distances on the plot can be interpreted as leading log2-fold-change, meaning the typical (root-mean-square) log2-fold-change between the samples for the genes that distinguish those samples.)

Would you know how this would compare to PCA?

ADD REPLY • link 6.4 years ago by m.laarman ▴ 10

0

Entering edit mode

Sorry for another reply. I've run the code you gave me and it worked, but I still have the same 'problem' now, in that there is no normal distribution in my data, since the large number of genes that had 0 counts before, now just shifted to around 4 (after normalization and log transformation). Therefore, a histogram of all genes of one of my samples looks like this:

histogram distribution of normalized logged counts from sample 1

I think the data except for the large bar of unexpressed genes might have a distribution more similar to a normal distribution, but how to deal with this?

ADD REPLY • link 6.4 years ago by m.laarman ▴ 10

0

Entering edit mode

Hey, that does not look like logged or variance-stabilised data. Can you show the code that you used? Regarding MDS and PCA, the mathematical formula behind each is different, but they both ultimately show relationships between samples / variables. Note that MDS can be performed using the PCA transformation.

You may consider a new question for that, as my time is extremely limited today.

ADD REPLY • link 6.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin,

Thanks! That doesn't sound good.. I've started a new post here: How to normalize count data for PCA in R - something goes wrong in case you would like to comment which I would greatly appreciate. I've posted the code there as well.

Thanks for your help so far!

ADD REPLY • link 6.3 years ago by m.laarman ▴ 10

0

Entering edit mode

Hi Kevin. Thanks for your contributions to this fine thread.

Sorry for my late question but I do not understand why you used DESeq() and counts() functions in your code above. It is my understanding that starting from a matrix of raw counts, simply running the following code will prepare the data for PCA analysis:

library("DESeq2")
dds <- DESeqDataSetFromMatrix(countData = cts,
                              colData = coldata,
                              design = ~ condition)
vsd <- vst(dds)
datapreparedforPCA <- assay(vsd)

and then we will go for:

PCA.res <- prcomp(t(datapreparedforPCA))

or since I used to work with FactoMineR package:

PCA.res <- FactoMineR::PCA(t(datapreparedforPCA), 
                                  scale.unit = FALSE, 
                                  graph = FALSE)

Am I missing something here? Is there something wrong with the codes above?

ADD REPLY • link 3.1 years ago by BioNovice247 ▴ 20

0

Entering edit mode

I want to explore RNAseq data from Expression Atlas (EMBL-EBI database) to get an impression on the similarity/differences of RNA-sequencing samples, i.e. to identify subgroups or outliers. The dataset downloaded from the database was in TPM, so I changed NAs into '0's and multiplied the values by 1E+6 (to get integer values). I want to log transform the data (rlog) but got stuck at:

log.data4pca = rlog(data4pca)

Which results an error: Error in (function (classes, fdef, mtable) : unable to find an inherited method for function ‘sizeFactors’ for signature ‘"data.frame"’

Can anyone help me with that?

Data:

Gene ID sample1 sample2 sample3 sample4 sample5
ENSG00000000003 0.6 65.0    30.0    61.0    62.0
ENSG00000000005     1.0 0.4 0.8 
ENSG00000000419 87.0    83.0    66.0    98.0    46.0
ENSG00000000457 3.0 11.0    9.0 10.0    5.0
ENSG00000000460 6.0 3.0 3.0 3.0 2.0
ENSG00000000938 308.0   6.0 6.0 12.0    6.0
ENSG00000000971 1.0 61.0    31.0    87.0    738.0
ENSG00000001036 38.0    85.0    74.0    28.0    37.0
ENSG00000001084 25.0    24.0    36.0    41.0    78.0
ENSG00000001167 45.0    25.0    19.0    24.0    13.0
ENSG00000001460 1.0 6.0 3.0 3.0 0.7
ENSG00000001461 5.0 20.0    13.0    17.0    1.0
ENSG00000001497 28.0    15.0    12.0    20.0    8.0
ENSG00000001561 71.0    34.0    27.0    16.0    11.0
ENSG00000001617 2.0 10.0    16.0    80.0    2.0
ENSG00000001626     86.0    62.0    1.0 1.0
ENSG00000001629 12.0    31.0    25.0    32.0    11.0
ENSG00000001630 2.0 4.0 9.0 5.0 4.0
ENSG00000001631 9.0 18.0    19.0    21.0    11.0
ENSG00000002016 4.0 9.0 10.0    6.0 2.0
ENSG00000002079                 
ENSG00000002330 18.0    30.0    20.0    10.0    9.0
ENSG00000002549 55.0    80.0    90.0    66.0    114.0
ENSG00000002586 46.0    117.0   99.0    160.0   69.0
ENSG00000002587 1.0 3.0 2.0 7.0 0.3
ENSG00000002726 1.0 257.0   248.0   1.0 0.1
ENSG00000002745 0.1 0.2 0.2 0.5 
ENSG00000002746     0.1 0.1 0.2 
ENSG00000002822 9.0 4.0 4.0 4.0 3.0
ENSG00000002834 123.0   121.0   148.0   51.0    45.0

Script:

library(PCAtools)
library(DESeq2)

#copy data form Excel
data4pca = read.table(pipe("pbpaste"), sep="\t", header=T)
#extract rownames from first column
rownames(data4pca) = data4pca[,1]
#remove first column with geneIDs
data4pca = data4pca[,-1]
# change NAs into '0's
data4pca[is.na(data4pca)] <- 0
#to get integer values (TPM)
data4pca = data4pca * 1000000
# rlog transform data
log.data4pca = rlog(data4pca)
#principal components analysis (PCA) on log-transformed data
pcadata <- pca(log.data4pca)

ADD REPLY • link 5.6 years ago by mschmidt ▴ 80

0

Entering edit mode

You are using my package, PCAtools? - DESeq2 has its own PCA function, plotPCA().

To use the DESeq2 implementation:

DESeq2::plotPCA(log.data4pca)

To use my PCAtools implementation:

pca(assay(log.data4pca))

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin, I am using your package (PCAtools). My problem is that I can't get the data rlog-transformed. Simple log transformation works, but rlog not.

ADD REPLY • link 5.6 years ago by mschmidt ▴ 80

0

Entering edit mode

Oh, I see, have you tried rlog(data.matrix(data4pca)) ?

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

Your package (PCAtools) has no rlog

> log.data4pca = rlog(data.matrix(data4pca))
Error in rlog(data.matrix(data4pca)) : could not find function "rlog"

with loaded DESeq2

> log.data4pca = rlog(data.matrix(data4pca))
converting counts to integer mode
Error in validObject(.Object) : 
  invalid class “DESeqDataSet” object: NA values are not allowed in the count matrix
In addition: Warning message:
In mde(x) : NAs introduced by coercion to integer range

its strange as I have all NAs converted to '0's, even

> log.data4pca = rlog(data.matrix(data4pca + 1))

returns the same error

ADD REPLY • link 5.6 years ago by mschmidt ▴ 80

0

Entering edit mode

Yes, rlog is part of DESeq2. You should check the encoding of your object via str(). rlog should be able to accept a data-matrix, which contains only numerical values.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

0

Entering edit mode

> str(data4pca)
'data.frame':   45946 obs. of  13 variables:
 $ bone.marrow           : num  6.00e+05 0.00 8.70e+07 3.00e+06 6.00e+06 3.08e+08 1.00e+06 3.80e+07 2.50e+07 4.50e+07 ...
 $ colon                 : num  6.5e+07 1.0e+06 8.3e+07 1.1e+07 3.0e+06 6.0e+06 6.1e+07 8.5e+07 2.4e+07 2.5e+07 ...
 $ duodenum              : num  3.0e+07 4.0e+05 6.6e+07 9.0e+06 3.0e+06 6.0e+06 3.1e+07 7.4e+07 3.6e+07 1.9e+07 ...
 $ esophagus             : num  6.1e+07 8.0e+05 9.8e+07 1.0e+07 3.0e+06 1.2e+07 8.7e+07 2.8e+07 4.1e+07 2.4e+07 ...
 $ liver                 : num  6.20e+07 0.00 4.60e+07 5.00e+06 2.00e+06 6.00e+06 7.38e+08 3.70e+07 7.80e+07 1.30e+07 ...
 $ lymph.node            : num  6.00e+06 1.00e+05 1.32e+08 1.70e+07 7.00e+06 6.30e+07 3.40e+07 3.60e+07 2.60e+07 5.80e+07 ...
 $ rectum                : num  6.70e+07 1.00e+06 8.30e+07 1.30e+07 5.00e+06 7.00e+06 9.80e+07 1.13e+08 2.70e+07 3.30e+07 ...
 $ saliva.secreting.gland: num  5.3e+07 4.0e+05 3.5e+07 6.0e+06 1.0e+06 3.0e+06 1.8e+07 1.7e+07 7.0e+06 1.0e+07 ...
 $ small.intestine       : num  2.2e+07 4.0e+05 7.8e+07 9.0e+06 3.0e+06 1.0e+07 5.3e+07 7.6e+07 2.9e+07 2.3e+07 ...
 $ spleen                : num  1.7e+07 5.0e+05 9.7e+07 1.4e+07 5.0e+06 1.8e+08 1.8e+07 6.3e+07 5.0e+07 3.9e+07 ...
 $ stomach               : num  2.2e+07 0.0 6.3e+07 1.0e+07 2.0e+06 6.0e+06 6.0e+07 7.1e+07 2.7e+07 2.5e+07 ...
 $ tonsil                : num  1.40e+07 3.00e+05 1.14e+08 1.40e+07 1.00e+07 3.60e+07 2.10e+07 4.10e+07 2.30e+07 4.90e+07 ...
 $ vermiform.appendix    : num  1.10e+07 2.00e+06 1.00e+08 1.30e+07 8.00e+06 1.72e+08 6.00e+07 7.00e+07 2.10e+07 4.20e+07 ...

ADD REPLY • link 5.6 years ago by mschmidt ▴ 80

0

Entering edit mode

Just looks like there is still something wrong with your formatting. Look at this example:

Generate random data and impute a `NA`

mat <- round(matrix(rexp(200, rate=.1), ncol=20), 0)
mat[7,9] <- NA

Try with `NA`

require(DESeq2)
rlog(mat)
Error in DESeqDataSet(se, design = design, ignoreRank) : 
  NA values are not allowed in the count matrix

Convert any `NA` to 0 and re-try

mat[is.na(mat)] <- 0
rlog(mat)
converting counts to integer mode
              1        2         3         4        5         6         7         8         9        10        11        12       13         14       15        16        17        18       19       20
 [1,] 1.6303296 2.101263 1.6303296 4.4426980 3.916323 2.5563570 1.6870700 4.2057665 1.4127739 2.7271864 2.5850314 2.5552601 2.286887  2.5540934 2.557778 2.5569548 3.2821198 2.9502635 3.622561 1.703656
 [2,] 2.9635154 2.963067 2.9635154 2.4257258 2.943951 3.9320118 4.1476477 2.0566027 2.9753406 2.0970838 2.3144479 3.6813475 2.965651  3.3690668 4.196746 3.4155491 2.9628123 2.0111253 2.481567 2.962925
 [3,] 2.2500342 3.188301 3.6284485 2.7554526 3.193896 0.4068467 4.4410790 0.6108451 3.7866725 0.9955659 2.8728750 1.1451716 3.620244  1.1661840 1.239904 3.7693159 1.7485206 0.5350876 2.592244 2.284964
 [4,] 4.4256292 3.108021 0.8907872 1.8005166 3.825927 2.8045730 4.6617094 0.7861423 3.1211707 1.7357820 2.7223201 3.1454056 4.921423  1.7119954 1.809768 3.3314205 1.5885089 2.6071480 2.105057 2.414051
 [5,] 4.2780751 4.160892 5.4845108 3.3502959 3.350899 3.2308789 3.3498840 3.3493575 3.3491791 3.3481371 3.3482641 1.8285660 4.849273  1.4041449 2.977995 3.2891094 2.7175965 3.3482017 3.349179 3.489831
 [6,] 1.3250958 2.979303 3.1284451 4.5200420 3.937701 3.9664888 4.0853001 4.0207723 3.4823841 3.1203092 1.3941709 3.5961874 4.379111  3.1715398 3.914777 4.2228263 3.4001319 4.0778840 5.350684 1.630279
 [7,] 1.4466070 1.224425 3.6512460 3.5304046 2.286591 1.9162281 1.5021949 3.6600331 0.5854978 2.3790455 0.6514361 2.5760491 2.643559  1.6948982 2.201923 2.0457115 2.8874726 1.1159874 3.051115 3.568114
 [8,] 1.9664247 4.327562 0.7690776 3.9564714 4.720774 2.1731031 3.6682082 1.3603538 3.2485642 2.5102086 2.0997103 2.6595941 3.671308  1.7433200 2.481331 0.5302784 0.4837551 2.9761034 2.263407 1.177239
 [9,] 2.6268004 3.554450 3.7418866 0.9981525 1.113274 3.4400201 3.6312819 1.8713298 1.3540877 2.4580801 0.2464204 3.5573662 5.202193  0.8450194 2.100423 3.6015767 2.9490219 1.9511093 3.711696 3.622012
[10,] 0.5877168 2.518681 2.1797479 1.5528766 3.777592 2.8244428 0.6276905 3.2713734 2.6839627 0.9035905 1.6319140 0.6812501 4.749080 -0.1144293 2.163285 3.9095113 1.6715744 1.2102973 1.896780 3.886975
attr(,"betaPriorVar")
[1] 4.796553
attr(,"intercept")
          [,1]
 [1,] 2.648234
 [2,] 2.991484
 [3,] 2.311582
 [4,] 2.675867
 [5,] 3.392713
 [6,] 3.485171
 [7,] 2.230926
 [8,] 2.439339
 [9,] 2.628809
[10,] 2.130695

To check if there is a NA anywhere in your data, do:

anyis.na(mat) == TRUE)

It returns TRUE if there is a NA present.

ADD REPLY • link 5.6 years ago by Kevin Blighe 88k

1

Entering edit mode

Dear Kevin, all the problems I had was probably because rlog needs DESeqDataSet (rlog(dds)), and I tried to generate dds from the TPM matrix which I multiplied by 1e+6. This produced too high counts making the error.

Thank you very much for your help! Best regards, Marcin

ADD REPLY • link 5.4 years ago by mschmidt ▴ 80

0

Entering edit mode

Hi Kevin, I am a new one. Would you please explain how to set legends with color in biplot? My data: each column is a sample, each row is a gene. Thank you very much.

ADD REPLY • link 6.4 years ago by linjc.xmu ▴ 30

0

Entering edit mode

Hello, you can add a custom legend to any plo using something like:

legend("bottomright", bty="n", cex=0.8, title="MyFirstLegend", c("Chinese", "Iraqis", "Iranians", "Syrians", "Russians"), fill=c("red1", "black", "forestgreen", "red2", "darkred"))

The first parameter can be any of "bottomright", "bottom", "bottomleft", "left", "topleft", "top", "topright", "right", or "center"

Kevin

ADD REPLY • link 6.4 years ago by Kevin Blighe 88k

1

Entering edit mode

Thank you very much!

ADD REPLY • link 6.4 years ago by linjc.xmu ▴ 30

0

Entering edit mode

How can I put the names of the samples in this case? And different colors for it group? Because in the pipeline there are just the basic information to plot a simple graph.

Tks

ADD REPLY • link 6.4 years ago by silas008 ▴ 170

0

Entering edit mode

@kevin Blighe In your PCAtools package. Do we need normalized count as an input or raw count and how variance is calculated? Do we need to perform rlog type transformation before performing the PCA?

ADD REPLY • link 4.3 years ago by Thind amarinder ▴ 340

View data

Create colour vector and shape vector for leaf / fruit

Perform PCA and plot PC1 vs. PC2

-------------------------

Perform PCA / single value decomposition

Scree plot

Pairs plots

Bi-plots

Tri-plot

Generate random data and impute a NA

Try with NA

Convert any NA to 0 and re-try

Generate random data and impute a `NA`

Try with `NA`

Convert any `NA` to 0 and re-try