Question

What can be done to separate the healthy and patient samples in PCA plot and Distance matrix

0

Entering edit mode

5.9 years ago

Mithil Gaikwad ▴ 50

Hello, I am doing RNA seq analysis to obtain Differential expression genes using DESeq2 for 13 patient and 6 Healthy donors. Before going for DESeq2 analysis, I am visualizing my samples by Distance matrix and PCA plot, using the following commands:

library("RColorBrewer")
sampleDistMatrix <- as.matrix(sampleDists)
rownames(sampleDistMatrix) <- paste(rld$condition, sep="-")
colnames(sampleDistMatrix) <- NULL
colors <- colorRampPalette( rev(brewer.pal(9, "Blues")) )(255)
pheatmap(sampleDistMatrix,
         clustering_distance_rows=sampleDists,
         clustering_distance_cols=sampleDists,
         col=colors)

DIstance matrix

plotPCA(rld, intgroup="condition")

: PCA plot

Here I am not able to visualize both the sample group separately. Am I suppose to discard the samples, which can not be separated and how one can find this exact sample? Is there any function or solution to separate the samples without discarding it? Kindly reply. Any help in this regard will be highly appreciated

    sessionInfo() R version 3.5.2 (2018-12-20) Platform: x86_64-pc-linux-gnu (64-bit) Running under: Ubuntu 18.04.1 LTS

Matrix products: default BLAS: /usr/lib/x8664-linux-gnu/blas/libblas.so.3.7.1 LAPACK: /usr/lib/x8664-linux-gnu/lapack/liblapack.so.3.7.1

locale: [1] LCCTYPE=enIN.UTF-8 LCNUMERIC=C
LCTIME=enIN.UTF-8 LCCOLLATE=enIN.UTF-8 [5] LCMONETARY=enIN.UTF-8 LCMESSAGES=enIN.UTF-8
LCPAPER=enIN.UTF-8 LCNAME=C [9] LCADDRESS=C LCTELEPHONE=C
LCMEASUREMENT=enIN.UTF-8 LC_IDENTIFICATION=C

attached base packages: [1] parallel stats4 stats graphics grDevices utils datasets methods base

other attached packages: [1] hexbin1.27.2 vsn3.50.0
pheatmap1.0.12 RColorBrewer1.1-2 [5] DESeq21.22.1 SummarizedExperiment1.12.0 DelayedArray0.8.0 BiocParallel1.16.5 [9] matrixStats0.54.0 Biobase2.42.0
GenomicRanges1.34.0 GenomeInfoDb1.18.1 [13] IRanges2.16.0 S4Vectors0.20.1
BiocGenerics_0.28.0

loaded via a namespace (and not attached): [1] bit640.9-7
splines3.5.2 Formula1.2-3 assertthat0.2.0
affy1.60.0 [6] BiocManager1.30.4 latticeExtra0.6-28 blob1.1.1 GenomeInfoDbData1.2.0 pillar1.3.1
[11] RSQLite2.1.1 backports1.1.3 lattice0.20-38
limma3.38.3 glue1.3.0 [16] digest0.6.18
XVector0.22.0 checkmate1.9.0 colorspace1.4-0
preprocessCore1.44.0 [21] htmltools0.3.6 Matrix1.2-15
plyr1.8.4 XML3.98-1.16 pkgconfig2.0.2
[26] genefilter1.64.0 zlibbioc1.28.0 purrr0.2.5
xtable1.8-3 scales1.0.0 [31] affyio1.52.0
htmlTable1.13.1 tibble2.0.1 annotate1.60.0
ggplot23.1.0 [36] nnet7.3-12 lazyeval0.2.1
survival2.43-3 magrittr1.5 crayon1.3.4
[41] memoise1.1.0 foreign0.8-70 tools3.5.2
data.table1.12.0 stringr1.3.1 [46] locfit1.5-9.1
munsell0.5.0 cluster2.0.7-1 AnnotationDbi1.44.0
bindrcpp0.2.2 [51] compiler3.5.2 rlang0.3.1
grid3.5.2 RCurl1.95-4.11 rstudioapi0.9.0
[56] htmlwidgets1.3 labeling0.3 bitops1.0-6
base64enc0.1-3 gtable0.2.0 [61] DBI1.0.0
R62.3.0 gridExtra2.3 knitr1.21
dplyr0.7.8 [66] bit1.1-14 bindr0.1.1
Hmisc4.1-1 stringi1.2.4 Rcpp1.0.0
[71] geneplotter1.60.0 rpart4.1-13 acepack1.4.1
tidyselect0.2.5 xfun_0.4

RNA-Seq DESeq2 PCA plot Distance matrix • 2.8k views

ADD COMMENT • link 5.9 years ago by Mithil Gaikwad ▴ 50

1

Entering edit mode

Before going for DESeq2 analysis

Does this mean you are directly using raw counts for clustering and PCA? If so, your results are not unusual. Also from the heatmap scale, I really think there is some sort of problem with normalization. You can use DESeq2 rlog/vst normalized counts for PCA.

ADD REPLY • link 5.9 years ago by venu 7.1k

0

Entering edit mode

Actually, I have transformed my raw counts using rlog and then I am visualizing my data in PCA plot and heatmap so that I could obtain proper Differential expressing genes.

ADD REPLY • link 5.9 years ago by Mithil Gaikwad ▴ 50

0

Entering edit mode

These plots that you have shown say virtually nothing about differentially expressed genes. These plots just show [mostly] unbiased / unsupervised relationships between your samples.

It looks like you may benefit from reading through the DESeq2 Vignette

By the way, in your PCA bi-plot, there is some evidence that 2 of your samples are outliers.

ADD REPLY • link 5.9 years ago by Kevin Blighe 88k

0

Entering edit mode

Thank you for your reply, Kevin, I have read DESeq2 vignette but it didn't find anything about solving such kind of error, however, Again i'll read and find it out what can be done. But as per your experience, can you give me any suggestions for solving this problem? Thank you

ADD REPLY • link 5.9 years ago by Mithil Gaikwad ▴ 50

0

Entering edit mode

I see what you mean. You want to identify the samples in the PCA bi-plot so that you can remove them?

Please take a look at this reproducible code (using random data), which will perform PCA and generate a bi-plot with lables (just for QC-ing):

x <- matrix(rexp(200, rate=.1), ncol=20)
rownames(x) <- paste("gene", 1:nrow(x))
colnames(x) <- paste("sample", 1:ncol(x))

p <- prcomp(t(x))

proportionvariances <- ((p$sdev^2) / (sum(p$sdev^2)))*100

#Plots scatter plot for PC 1 and 2
plot(p$x,
  type="n",
  main="Principal components analysis bi-plot",
  xlab=paste("PC1, ", round(proportionvariances[1], 2), "%"),
  ylab=paste("PC2, ", round(proportionvariances[2], 2), "%"))
points(p$x, col="black", pch=16, cex=1) #add points
text(p$x, rownames(p$x), col="black", cex=1) #add labels

Your object is called rld, with the counts in this object being accessible via assay(rld). So, your code needs to be:

p <- prcomp(t(assay(rld)))

proportionvariances <- ((p$sdev^2) / (sum(p$sdev^2)))*100

#Plots scatter plot for PC 1 and 2
plot(p$x,
  type="n",
  main="Principal components analysis bi-plot",
  xlab=paste("PC1, ", round(proportionvariances[1], 2), "%"),
  ylab=paste("PC2, ", round(proportionvariances[2], 2), "%"))
points(p$x, col="black", pch=16, cex=1) #add points
text(p$x, rownames(p$x), col="black", cex=1) #add labels

The resulting bi-plot will differ from your original. This is because DESeq2's PCA function automatically removes a large proportion of variables prior to performing PCA, whereas, my code does not do this.