Dear Sir/madam,
I am working on LUAD data of gene expression, this data is RNA-SeqV2 RSEM normalized data. I want to find out the DEG between Tumor stage I and Stage IV. But im facing some problems, I'm using EBSeq package to find the DE genes, Below is the R code i have used (followed according to the EBSeq vignette)
rna <- read.table("RNA/LUAD_RNA_seq.txt",nrows = 20533, header = T, row.names = 1, sep = "\t")
#removing zero expressions
rem <- function(x){
x <- as.matrix(x)
x<- t(apply(x,1,as.numeric))
r <- as.numeric(apply(x,1,function(i)sum(i == 0)))
remove <- which(r > dim(x)[2]*0.5)
return(remove)
}
remove <- rem(rna)
rna <- rna[-remove,]
dim(rna)
x <- t(apply(rna,1,as.numeric))
dim(x)
#removing probes without symbol
row.has.na <- apply(x, 1, function(x){anyis.na(x))})
sumrow.has.na)
x <- x[!row.has.na,]
dim(x)
colnames(x) <- gsub('\\.','-',substr(colnames(rna),1,28))
head(x)
library(EBSeq)
str(x)
#read files of tumor stage
library(xlsx)
Tumor_stage <- read.xlsx(file = "Clinical_merged_picked/Tumor_stage_data.xlsx", sheetIndex = 1)
colnames(Tumor_stage)
Tumor_stage$Stage_I_Barcode <- toupper(Tumor_stage$Stage_I_Barcode)
Tumor_stage$Stage_IV_Barcode <- toupper(Tumor_stage$Stage_IV_Barcode)
#select only tumor samples
Tumor_Samples <- subset(x, select = substr(colnames(x),14,14) == '0')
dim(Tumor_Samples)
#subset the stages (I and IV) of tumor samples from Tumor_samples
Stage_I <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_I_Barcode)
dim(Stage_I)
Stage_IV <- subset(Tumor_Samples, select = substr(colnames(Tumor_Samples),1,12) %in% Tumor_stage$Stage_IV_Barcode)
dim(Stage_IV)
#remove duplicate colnames
Stage_I <- Stage_I[, !duplicated(colnames(Stage_I))]
Stage_IV <- Stage_IV[, !duplicated(colnames(Stage_IV))]
stage_d126samples <- cbind(Stage_I[,1:100], Stage_IV[,1:26])
dim(stage_d126samples)
#library size factor
#1. by median normalization approach,
size_d126 = MedianNorm(stage_d126samples)
## running EBSeq on gene expression estimates
## the function EBTest is used to detect the DE genes
EBOut_d126 = EBTest(Data = stage_d126samples, Conditions = as.factor(c(rep("stage_i",100),rep("stage_iv",26))), sizeFactors = size_d126, maxround = 5)
EBDE_d126 = GetDEResults(EBOut_d126, FDR = 0.05)
EBDE_d126$DEfound
GeneFC_d126 = PostFC(EBOut_d126)
PlotPostVsRawFC(EBOut_d126,GeneFC_d126)
par(mfrow = c(1,2))
QQP(EBOut_d126)
par(mfrow = c(1,2))
DenNHist(EBOut_d126)
EBOut_d$Alpha
EBOut_d$Beta
#subsetting DEG from gene expression data
idx_126 <- EBDE_d126$DEfound
eset <- EBOut_d126$DataNorm[idx_126,]
dim(eset)
head(eset)
#row clustering
hr <- hclust(as.dist(1-cor(t(eset), method="pearson")),method="complete")
par(mfrow = c(1,1))
plot(as.dendrogram(hr))
colnames(eset) <- c(rep("stage_i",100),rep("stage_iv",26))
hc <- hclust(as.dist(1-cor(eset, method="spearman")), method="complete")
plot(as.dendrogram(hc))
library(gplots)
heatmap.2(eset,main="Hierarchical Cluster",Rowv=as.dendrogram(hr), Colv=NA, dendrogram="row", scale="row", col=colorpanel(75,"green","black","red"), density.info="none", trace="none")
here are the images of the plots generated,
As in my heatmap, I can not see any differentiating pattern
1) Please explain me if I'm doing anything or using any command in wrong.
I know i have used normalized counts as a input to EBSeq, but i'm not so sure if i have done the right thing.
I have referred to one question answers (FAQ) link about EBSeq, where i have found this, In general, it is not appropriate to perform cross sample comparisons using TPM, FPKM or RPKM without further normalization. Instead, you may use normalized counts (It can be generated by GetNormalizedMat() function from raw count, note EBSeq testing functions takes raw counts and library size factors)
2) So according to above statement, i can give normalized counts to EBSeq ? if my understanding is not correct then please tell me. It will surely clear my doubts.
Your suggestions will make me clear my problem. I look forward to hear from you,
Thanks, Ankita