HI, I want to find differential expressed genes in TCGA datasets by edgeR. I used RNA-seq data measured by RPKM as input. The below code is basically followed edgeR manual section4.2 (https://bioconductor.org/packages/release/bioc/vignettes/edgeR/inst/doc/edgeRUsersGuide.pdf) combined with a post on Biostars (https://www.biostars.org/p/212200/).
#reading files
file1="cancer.exprs.txt"
file2="normal.exprs.txt"
cancer.exprs = read.table(file = file1,header=T,sep="\t")
normal.exprs = read.table(file = file2,header=T,sep="\t")
#combine cancer and normal expression data and filter by median of expression value
exprs=cbind(cancer.exprs,normal.exprs)
rownames(exprs) = rownames(cancer.exprs)
median = apply(exprs,1,median)
exprs.filter = exprs[which(median >= 0.1),]
#put the expression data into edgeR object
y = DGEList(counts = exprs.filter, genes = rownames(exprs.filter))
#construct design model
Tissue = factor(c(rep("T",dim(cancer.exprs)[2]),rep("N",dim(normal.exprs)[2])))
data.frame(Sample = colnames(y),Tissue)
design = model.matrix(~0+Tissue)
#compute differential expressed genes
y = estimateDisp(y, design, robust=TRUE)
y$common.dispersion
fit = glmFit(y, design)
lrt = glmLRT(fit)
topTags(lrt,n=100)
cancer.exprs.txt and normal.exprs.txt are two files containing expression value measured by RPKM from cancer and normal samples. In these two files each column represents each sample and each row represents each gene. The first column and row are sample id and gene id. After running the code, I find that all genes are up-regulated genes with very huge logFC (I think it is log foldchange). Even I used cancer.exprs.txt as both cancer and normal input file, this code also consider all genes are up-regulated genes with very huge logFC. What is the wrong with me? Thanks
I will try raw counts. But I think it is unreasonable to ban to use RPKM as input. Under this situation, I can do nothing but use raw counts? Is there some ways to use RPKM as input? Thanks
RPKM has not been "banned" as input. edgeR uses a negative binomial distribution to model read counts, because read counts follow reasonably well the negative binomial distribution (whereas RPKMs do not) . So it is unreasonable to use data that do not follow the assumptions of the model, and expect sensible results.
Well, I tried raw counts as input. However the outcome still show that all genes are up-regulated genes. I think it is not because of RPKM (Although using raw counts is correct). It is probably because of my code. Is there something wrong about my code? Thanks.