Question

EdgeR multiple samples

0

Entering edit mode

2.8 years ago

Pahi ▴ 30

Hi!

Recently I'm working on an R pipeline for RNA-seq analysis. My goal is to create a script that compares my samples to each other and compute the DE.

So far I created my script and its working well. But when I implemented loops which would make my life easier, I noticed that the result are different. To be precise: First I did the comparisons sample pair by pair, than with the loops I loaded in all my data and made the comparison sample pair by pair.

What I noticed is everything goes as it should be, than when I reach the part where I call the

fit <- glmQLFit(dge,dge$design,robust = TRUE)

function, the a little difference shows up:

enter image description here

The first pic shows when ALL the samples loaded in, the second is when only 2 group (6 samples) is loaded in. After that when I want to make a

res <- glmQLFTest(fit,contrast = c(-1, 0,1,0))

it obviously shows different result. Edger uses different calculations based on the loaded samples? Also it makes the DE totally different when I try to visualize it. For the 2 example:

enter image description here

One plot shows a lot of DE genes (this is where only 2 groups (6 samples) were compared), and the other only shows 2 (this is where all the samples processed together).

My code looks the same even if I use 2 group samples or multiple group samples.

What could be the problem? Thank you for your time and help!

EdgeR R • 1.1k views

ADD COMMENT • link updated 2.8 years ago by seidel 11k • written 2.8 years ago by Pahi ▴ 30

2

Entering edit mode

If I understand correctly, your question is basically that you get a slightly different result if you calculate DE using all your data versus subsets of your data. The reason the results will not be identical is because the variance estimate for a gene will be much more informed in a larger data set, across more conditions or samples, than in a smaller data set. It's not so much a problem as a feature. Better estimates of variance allow better evaluation of significance.

ADD REPLY • link 2.8 years ago by seidel 11k

0

Entering edit mode

My code :

group <- target$group

reads <- CTRL_INJURED_vs_CTRL_UNINJURED 
gene_name <- rownames(reads)

dge <- DGEList(counts=reads, genes = gene_name, group = group) 
dge$design <- model.matrix(~0+group, data=dge$samples)

dge$genes$ENTREZID <- mapIds(org.Dm.eg.db, dge$genes$genes, keytype = "ENSEMBL",column="ENTREZID") 
dge$genes$Symbol <- mapIds(org.Dm.eg.db, dge$genes$genes, keytype = "ENSEMBL", column="SYMBOL")


dge <- dge[!is.na(dge$genes$ENTREZID),] 
dge <- dge[!is.na(dge$genes$Symbol),] 

keep_dge <- filterByExpr(dge,dge$design) 
table(keep_dge) 
dge <- dge[keep_dge,,keep.lib.sizes=FALSE]

dge <- calcNormFactors(dge) 
dge <- estimateDisp(dge)

dge_df <- as.data.frame(dge)

fit2 <- glmQLFit(dge,dge$design,robust = TRUE) 
head(fit$coefficients)

colsToTest <- grep(varInt,colnames(fit2$design)) 
namesToTest <- paste(gsub(varInt,"",colnames(fit2$design)[colsToTest]),"_vs_",condRef)


results <- list() 
lrt_list <- list()

if (length(colsToTest)>=2){   colnames <- gsub(varInt,"",colnames(fit2$design))   for (comp in combn(length(colsToTest),2,simplify=FALSE)){
    contrast <- numeric(ncol(dge$design))
    contrast[colsToTest[comp[1:2]]] <- c(-1,1)
    namecomp <- paste0(colnames[colsToTest[comp[2]]],"_vs_",colnames[colsToTest[comp[1]]])
    cat(paste0("Comparison ",gsub("_"," ",namecomp),": testing contrast (",paste(contrast,collapse=", "),")"),"\n")
    lrt <- glmQLFTest(fit2, contrast=contrast)
    results[[namecomp]] <- topTags(lrt,n=nrow(dge$counts))$table
    lrt_list[[namecomp]] <- lrt      } }

ADD REPLY • link 2.8 years ago by Pahi ▴ 30