DEG analysis with limma and contrast matrix using multiple Parkinson's cohorts in whole blood: is it normal to always get NS?
3
0
Entering edit mode
7 months ago
egascon ▴ 60

Hello,

I am writing to ask for your help regarding the DEG analysis of several Parkinson's cohorts.

I have already done several transcriptomics and DEG analyses with other diseases. However, when I analyse the GSE18838 cohort, I get the following:

Down 0
NotSig 22010
Up 1

And in other cohorts of the same tissue and Parkinson's vs Control patients from other experiments, I also get NS in all of them. I understand that it is a possible result but I find it strange when there is a published paper that has also done the DEG analysis with the same cohort (GSE18838) and the DEGs come out.

So I wanted to ask you if you could check the code in case you see something strange or if I have done something wrong. Here are all the steps I have done with GSE18838 cohort

**1. I analyse the metadata**

GSE18838 <- getGEO('GSE18838')

length(GSE18838)

class(GSE18838[[1]])

GSE18838 <- GSE18838[[1]]

varLabels(GSE18838) #see labels

info <- pData(GSE18838) ## print the sample information


**2. Processing raw files and associating metadata information**

getGEOSuppFiles("GSE18838")

untar("GSE18838/GSE18838_RAW.tar", exdir = 'data/')

celfilelist <- list.files(path = "data/", pattern = ".CEL", full.names = TRUE)

gse <- read.celfiles(filenames = celfilelist, phenoData=phenoData(GSE18838))


**3. Review and normalize data**

summary(exprs(gse))

boxplot(exprs(gse), outline=FALSE, ylab = "Expression level", xlab = "Samples", col = "red")

abline(h=median(exprs(gse)),col="blue")

plotDensities(exprs(gse), legend = FALSE)

hist(exprs(gse), col="blue", border="white", breaks=100, main = "Histogram of raw expression")

normalized.data <- oligo::rma(gse)

boxplot(normalized.data, outline=FALSE, ylab = "Expression level", xlab = "Samples", col = "red")

abline(h=median(exprs(normalized.data)),col="blue")

plotDensities(exprs(normalized.data), legend = FALSE)

Here you can see how the data ranges from a minimum value of 20 to a maximum of approximately 60,000. And the median does not coincide between the boxes, so we normalise by RMA. So far everything seems to make sense.

**4. DEG analysis:**

pData(normalized.data) #We are interested in making disease vs control -> we select the disease state:ch1 tag to make the layout.

experimental.design <- model.matrix( ~ 0 + (normalized.data)[['disease state:ch1']])

colnames(experimental.design) <- levels(as.factor(normalized.data[['disease state:ch1']]))

colnames(experimental.design) <- c("Healthy", "PD")

contrast_matrix <- makeContrasts(PD - Healthy, levels=experimental.design)

contrast_matrix

linear.fit <- lmFit(normalized.data,experimental.design)

contrast.linear.fit <- contrasts.fit(linear.fit,contrasts=contrast_matrix)

contrast.results <- eBayes(contrast.linear.fit)

summary(decideTests(contrast.results,lfc=0))  #lfc=0 to see all genes with p.value < 0.05

The results that come out according to summary:

PD - Healthy
Down 0
NotSig 22010
Up 1

And the volcanoplot looks like this:

enter image description here

And I have applied this same coding to three other full blood Parkinson's cohorts doing a disease vs. control comparison. Is it possible that I have something wrong or that something strange is going on?

According to this paper:

Xing N, Dong Z, Wu Q, Zhang Y, Kan P, Han Y, Cheng X, Wang Y, Zhang B. Identification of ferroptosis related biomarkers and immune infiltration in Parkinson's disease by integrated bioinformatic analysis. BMC Med Genomics. 2023 Mar 14;16(1):55. doi: 10.1186/s12920-023-01481-3. PMID: 36918862; PMCID: PMC10012699.

SDRs should be released.

Thank you very much for your help.

DEG limma rstudio • 1.9k views
ADD COMMENT
0
Entering edit mode

Looking at this briefly, it is small sample size and you do not correct for any potential confounders. Use PCA or MDS to explore the data. Are there outliers, batch effects, anything that needs correction? Or are there simply do DEGs? It's possible.

ADD REPLY
0
Entering edit mode

I have checked the quality of the arrays using PCA and Boxplot (images attached) and some of the samples may deviate a little but perhaps not too much.

PCA

enter image description here

Boxplot

enter image description here

Maybe there is no DEG and that is a possible result but I find it strange because what I get does not agree with other analyses of other fellow researchers such as the paper by Xing et al. 2023, which also analyses the same spring with a result of 399 DEGs. Is it possible that this was a mistake on the part of the researchers?

ADD REPLY
0
Entering edit mode

Wait, so the magnitudes of the PCs are in the thousands? That's not normal for RNASeq.

ADD REPLY
0
Entering edit mode

It's a microarryay experiment, not RNASeq

ADD REPLY
0
Entering edit mode

Would Parkinson's be expected to have an effect on blood cells? Neurons, sure, but blood cells?

ADD REPLY
0
Entering edit mode

This is true. However, there are a number of studies that have done bioinformatics analysis in whole blood to see if there is a possibility of finding biomarkers.It makes sense that are no DEGs. The advantage is that at the diagnostic level it is less invasive. But it's strange because these paper show several cohorts with more o less DEGs. And yet, using the same packages (limma, p-value adjusted with FDR), I don't get the same results. So, it makes me doubt if I am doing it well.

ADD REPLY
1
Entering edit mode
7 months ago
ATpoint 86k

I think the issue is that there is some contamination of groups. If you plot a PCA or MDS you see that there is a clear separation between groups but with some contamination. THe use of empirical weights can compensate for this. Here is a minimal version of your code, using arrayWeights() before lmFit, resulting in lots of DEGs.

options(timeout = 9999999)
library(GEOquery)
library(limma)
library(ggplot2)
library(oligo)
library(pd.huex.1.0.st.v2)

GSE18838 <- getGEO('GSE18838')
GSE18838 <- GSE18838[[1]]
getGEOSuppFiles("GSE18838")
untar("GSE18838/GSE18838_RAW.tar", exdir = 'data/')
celfilelist <- list.files(path = "data/", pattern = ".CEL", full.names = TRUE)

gse <- read.celfiles(filenames = celfilelist, phenoData=phenoData(GSE18838))
y <- oligo::rma(gse)

# FALSE/non-Parkinson is the reference so its Parkinson - COntrol
y$diseased <- factor(ifelse(grepl("none", y$`disease state:ch1`), FALSE, TRUE))
design <- model.matrix(~diseased, data.frame(pData(y)))

library(limma)
library(ggplot2)

# This suggests some outliers or wrongly assigned labels as clear group separation with some contamination
plotMDS(exprs(y), labels=y$diseased, pch=20)

# Compensate outliers with empirical weights
aw <- arrayWeights(exprs(y), design)
fit <- lmFit(exprs(y), design, weights = aw)
fit <- eBayes(fit)

tt <- topTable(fit, coef=2, number = Inf)

# Lots of DEGs...
nrow(tt[tt$adj.P.Val < 0.05 & abs(tt$logFC) > .5,])
> 1830
ADD COMMENT
1
Entering edit mode

There has been some spam bots in this thread which we cleaned up, you might still get notifications on it, just ignore.

ADD REPLY
0
Entering edit mode

Thank you very much!

I have repeated the code adding arrayWeights() and it's works! In this case, I have a desing without intercept and contrast matrix.

I have curiosity, why do you make a desing with interceptor and without contrast matrix?

Thank you again

ADD REPLY
2
Entering edit mode

I have curiosity, why do you make a desing with interceptor and without contrast matrix?

Because for a 2-group comparison you do not need it. One level of the group (here the non-disease) becomes the intercept, so the second one (coef=2) is the contrast that is tested. Contrasts and no-intercept designs become handy when you have more complex comparisons. Still, you would get the exact same results if you encode above analysis with a no-intercept and contrast matrix. It's just a different way of encoding the analysis.

ADD REPLY
0
Entering edit mode

Thank you very much again. I have learned a lot of things.

ADD REPLY
0
Entering edit mode
6 months ago
Gordon Smyth ★ 7.7k

It is very common for studies like this to show no significant DE. Human studies like this where comparisons are made between diseases and non-diseased groups need very large samples sizes to get reliable results.

If you read the paper by Xing et al (2023) closely, you'll see that they actually didn't have any DE by generally accepted standards. They didn't do any adjustment for multiple testing, and simply took genes with P-value < 0.05 to be DE instead of the generally accepted standard of FDR < 0.05 or FDR < 0.1. Some of their reported results may be false discoveries.

You can increase your power to detect DE genes by using array weights (as you have done) and by setting robust=TRUE and trend=TRUE when running eBayes. I recommend array weights as a routine procedure with human data.

ADD COMMENT

Login before adding your answer.

Traffic: 1975 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6