Read In Data

Question

edgeR RNAseq getting no differentially expressed genes when l compare my groups

0

Entering edit mode

6.7 years ago

mbanoian • 0

Hi All, l am at the beginning of my RNAseq journey and l have bumped into my first major snag.

I have adopted the methods that were presented in the publication : RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 1; referees: 3 approved].

I have (3)control, (7)TBHIV, (4)TB,(5)HIV patient tissue.

I ran the script below and l keep on getting no differentially expressed genes when l perform a decideTests. :

RNAseq script

rawdata <- read.table("SteynRawCounts_Filtered_outliersremoved.txt", header=TRUE, sep="")
colnames(rawdata) <- c("Genes", rep("TBHIV",4),"HIV",rep("TB",2),"TBHIV",rep("HIV",2),"TBHIV","HIV","TB","HIV","TBHIV","TB",rep("Control",3))
library(ggplot2)
library(grid)
library(gridExtra)
library(edgeR)
y <- DGEList(counts=rawdata[,2:20], genes=rawdata[,1])
lcpm <- cpm(y, log=TRUE)
thresh <- cpm(y) > 0.5
keep <- rowSums(thresh) >= 4
y <- y[keep,]
plotMDS(y)

# recalculate library size and calculate normalization factors
y$samples$lib.size <- colSums(y$counts)
y <- calcNormFactors(y, method="TMM")
y <- estimateDisp(y, design, robust=TRUE)
plotMDS(y)

#Preparing for PCA and removing counts with zero variance
x <- y$counts
vars <- apply(y$counts, 1, var)
keep <- vars > 0
x <- x[keep,]
x <- cpm(x, log=T, prior.count=3)
samplenames <- as.factor(c(rep("TBHIV",4),"HIV",rep("TB",2),"TBHIV",rep("HIV",2),"TBHIV","HIV","TB","HIV","TBHIV","TB",rep("Control",3)))
colnames(x) <- samplenames

## pca
pca <- prcomp(t(x), scale.=TRUE)
plot(pca$x[,1], pca$x[,2], pch=19, col=samplenames)
#prepare the model design
diagnosis <- factor(c(rep("TBHIV",4),"HIV",rep("TB",2),"TBHIV",rep("HIV",2),"TBHIV","HIV","TB","HIV","TBHIV","TB",rep("Control",3)))
pathology <- factor(c("A","C","A","C","A","A","C","A","A","C","C","A","C","C","B","A","C","C","C"))
design <- model.matrix(~0+diagnosis)
colnames(design) <- gsub("diagnosis","",colnames(design))

contr.matrix <- makeContrasts(
  ControlvsHIV = HIV-Control,
  ControlvsTB = TB-Control,
  ControlvsTBHIV = TBHIV-Control,
  TBHIVvsHIV = HIV-TBHIV,
  TBHIVvsTB = TB-TBHIV,
  HIVvsTB = TB-HIV,
  levels = colnames(design))

v <- voom(y, design, plot=TRUE)
vfit <- lmFit(v,design)
vfit <- contrasts.fit(vfit, contrasts=contr.matrix)
efit <- eBayes(vfit)
plotSA(efit)
summary(decideTests(efit))

thank you for any pointers.

RNA-Seq R • 2.8k views

ADD COMMENT • link updated 6.7 years ago by Devon Ryan 104k • written 6.7 years ago by mbanoian • 0

0

Entering edit mode

could you reformat that and keep to the relevant bits please; there seems to be multiple contrast matrices for example

ADD REPLY • link 6.7 years ago by russhh 5.7k

0

Entering edit mode

Hi russhh, I have re-formatted it just now. It's just a large block of code, as you can see. I don't have time to look over it in detail right now.

ADD REPLY • link 6.7 years ago by Kevin Blighe 88k

0

Entering edit mode

Could you confirm how and why you removed outliers and filtered the samples when you set the dataset up, please.

ADD REPLY • link 6.7 years ago by russhh 5.7k

0

Entering edit mode

Hi rush, l think the outliers were removed because we were not happy with the quality of the data. Sorry for the multiple contrasts, l will sort that out now.

ADD REPLY • link 6.7 years ago by mbanoian • 0

0

Entering edit mode

Do you have cross-contrast significance for any of your features? That is, if you run topTable() on efit without specifying a particular contrast do you have any hits at adj.P < your threshold. To my mind, you've jumped into single-contrast tests too early. (aside: you might want to add library(limma) at the start of your script). If you get no hits, or very few hits, could you plot logFC versus AveCPM for each of your contrasts and see if there's any systematic biases that are wiping out any hint of differential expression.

ADD REPLY • link 6.7 years ago by russhh 5.7k

0

Entering edit mode

Hi rush (sorry my autocorrect keeps changing your name). Thank for you responses. I have been doing a bit of backtracking and l have noticed that the issue might be to do with the way l annotated the genes in DGEList.

The paper l am emulating used this code to annotate the genes: library(Mus.musculus) geneid <- rownames(x) genes <- select(Mus.musculus, keys=geneid, columns=c("SYMBOL", "TXCHROM"),head(genes),keytype="ENTREZID") x$genes <- genes

when l tried to do this it is using the initial gene lists (before removal of under expressed genes) i.e saying that genes has 56355 opposed to 23154 after processing. This is the code l used.

y1 <- y[keep,] genes <- rawdata_new[,1] y1$genes <- genes y1

ADD REPLY • link 6.7 years ago by mbanoian • 0

0

Entering edit mode

Hi rush, l performed a fit using this code:

Read In Data

rawdata <- read.table("Trimmed Groups HIVTB comparison1.txt", header=TRUE, sep="") names(rawdata) group <- as.factor(c(rep("TBHIV",5),"TB","TB","TBHIV",rep("TB",4))) design <- model.matrix(~0+group) colnames(design) <- gsub("group", "", colnames(design)) design contr.matrix <- makeContrasts(TB.vs.TBHIV = TB-TBHIV, levels = colnames(design)) y <- DGEList(counts=rawdata[,2:13], genes=rawdata[,1],group=group) cpm <- cpm(y) lcpm <- cpm(y, log=TRUE) thresh <- cpm(y) > 0.5 keep <- rowSums(thresh) >= 2 dim(y) y <- y[keep,] dim(y) y <- calcNormFactors(y)

y <- estimateGLMCommonDisp(y,design) names(y) y$common.dispersion sqrt(y$common.dispersion) y <- estimateGLMTrendedDisp(y,design) summary(y$trended.dispersion) y <- estimateGLMTagwiseDisp(y,design) y$prior.df getPriorN(y,design) head(y$tagwise.dispersion) summary(y$tagwise.dispersion)

Generalized Linear Model Fit

glmfit.tgw <- glmFit(y,design,contrast=contr.matrix) lrt.tgw <- glmLRT(glmfit.tgw) topTags(lrt.tgw) summary(decideTestsDGE(lrt.tgw))

this was the out l got:

Coefficient: TBHIV genes logFC logCPM LR PValue FDR 53 KDM1A -14.38859 5.587484 1135.152 7.566662e-249 1.429321e-244 11952 DCTN5 -14.38519 5.657110 1134.195 1.221537e-248 1.429321e-244 7781 C1RL -13.99384 6.045328 1128.796 1.820943e-247 1.420457e-243 2485 CSNK2A1 -14.33314 5.615519 1127.431 3.604872e-247 2.109030e-243 5134 SPOP -13.88495 6.002822 1126.798 4.948748e-247 2.316212e-243 7181 EIF4E2 -14.45320 5.537715 1125.427 9.831066e-247 3.640486e-243 3045 URI1 -14.38462 5.569152 1125.222 1.088941e-246 3.640486e-243 12111 ZNF641 -14.53484 5.405809 1124.649 1.450661e-246 4.243546e-243 12221 SRP68 -14.21717 5.722162 1124.050 1.957926e-246 5.091043e-243 7402 UBAP2 -14.13349 5.793155 1123.103 3.145033e-246 7.360007e-243

but still no DE genes:

summary(decideTestsDGE(lrt.tgw)) TBHIV -1 23402 0 0 1 0

I had trimmed down my groups, so l am only comparing TB to TBHIV patients.

ADD REPLY • link 6.7 years ago by mbanoian • 0