HI,
I have 20 samples of leishmaniasis vaccine that all of them belong to a low dose, with four different time courses, which each has five replicate. i prepare a count table which contains gene names, and sample names with count numbers. I intend to make the identification of DE genes using a log2 fold change and likelihood ratios (LR) Test in edgeR and significantly expressed genes had an FDR adjusted P-value of < 5%. Since i am new to EdgeR, i want to know if i am following the right path or not? because in the end number of the DE genes that i get is 10.
Thanks in advance
library(edgeR)
setwd("~/Desktop/counts/low")
x_low<-read.csv("counts.txt")
time<-factor(c("pre","2hr-post","24hr-post","14d-post",
"pre","2hr-post","24hr-post","14d-post",
"pre","2hr-post","24hr-post","14d-post",
"pre","2hr-post","24hr-post","14d-post",
"pre","2hr-post","24hr-post","14d-post"))
time <- relevel(time, ref="pre")
y_low<-DGEList(counts=x_low[,2:21], genes = x_low[,1], group = time) #read counts.csv and make table
keep <- filterByExpr(y_low)
y_low <-y_low[keep, keep.lib.sizes=FALSE]
y_low <- calcNormFactors(y_low)
plotMDS(y_low)
data.frame(sample=colnames(y_low),time) #data frame
sample time
1 Sample_1 pre
2 Sample_2 2hr-post
3 Sample_3 24hr-post
4 Sample_4 14d-post
5 Sample_5 pre
6 Sample_6 2hr-post
7 Sample_7 24hr-post
8 Sample_8 14d-post
9 Sample_9 pre
10 Sample_10 2hr-post
11 Sample_11 24hr-post
12 Sample_12 14d-post
13 Sample_13 pre
14 Sample_14 2hr-post
15 Sample_15 24hr-post
16 Sample_16 14d-post
17 Sample_17 pre
18 Sample_18 2hr-post
19 Sample_19 24hr-post
20 Sample_20 14d-post
design<-model.matrix(~time)
y_low<-estimateDisp(y_low,design)
fit<-glmFit(y_low,design)
lrt<-glmLRT(fit)
deg = topTags(lrt,p.value = 0.05)$table
the result :
genes logFC logCPM LR PValue FDR
29374 ENSG00000227740 -20.78082 -0.06178923 1898.767 0 0
29967 ENSG00000228601 -20.70710 0.01400828 4597.612 0 0
3640 ENSG00000108924 -20.66524 0.04755844 3114.040 0 0
5382 ENSG00000123119 -20.65702 0.10512820 1486.161 0 0
2370 ENSG00000100600 -20.64387 0.21416278 1731.573 0 0
43848 ENSG00000252756 -20.62157 0.07501494 3396.423 0 0
29983 ENSG00000228624 -20.61890 0.10010055 1563.587 0 0
31399 ENSG00000230583 -20.61545 0.08888092 1783.825 0 0
52898 ENSG00000265784 -20.57370 0.24883706 17582.995 0 0
54539 ENSG00000267802 -20.53795 0.11711300 1954.292 0 0
Firstly thanks for your helpful response. I aim to design a test for pre vs. post to get DE genes. i got the filtering part Gordon, but about LRT, i thought that i compare pre vs. post samples when i did
lrt<-glmLRT(fit, coef = 2:4)
, would you please explain more about that. i appreciate that.About the asking question on platforms, i will follow next time Bioconductor.
There are few issues here, and I have also edited my answer above.
lrt <- glmLRT(fit, coef = 2:4) compares everything to everything, because you only have 4 time-points and hence only 3 df for comparisons.
Your code immediately overwrites the above with lrt <- glmLRT(fit) so you haven't stored the above result anyway. glmLRT(fit) tests something different, as I now explain in my answer above.
Thanks, dear Gordon . sorry for the late response, i catch the concept, and i will follow the commands as you told.
Dear Gordon, for a week i try to tackle the problems of my script as you told, but with changes that you mentioned i neither could not get the DEG or maybe i did not follow the correct pathway. i plot my samples by
plotMDS(y_low)
as you advised. The results are as you see in the Image. I changed the script above and added the options as you suggested. i would like to know the problem is in the designing of my matrix, or the option that i set in glmLRT is wrong? Thanks in advanceAs I predicted in my answer, the MDS plot makes it immediately obvious why you have no DE.
The MDS plot shows a massive batch effect associated with samples 5 to 8 that you haven't accounted for. Obviously that would prevent you finding any DE because the batch effect is so much larger than the effects you are testing for.
It is seems clear from the MDS plot that each group of four samples represents a separate batch. That is an aspect of the experimental design that you need to consider and need to include in the design matrix. The biologists who generated the data should be able to explain to you how the experimental procedure worked. I already nudged you to think about these sort of issues and highlighted the possibility of batch effects in my original answer.