Hello,
I am wondering about the data analysis strategy implemented to calculate the differential expression analysis between wildtype and knockout samples for 2 cell lines. It is a Illumina bead array chip data obtained from [GEO]. I have total of 4 samples for each cell line (2 controls, 2 treated). End output I would like to obtain is foldchange and pvalues; for comparisons done between controls and treated samples from 2 cell lines. Output: B_vs_A_log2FC B_vs_A_pvalue D_vs_C_log2FC D_vs_C_pvalue
I have my code (below), that could perform my task.
But I am wondering about 3 questions:
1) Is it correct to apply neqc function to normalize the expression values of .idat files
2) My design matrix without weights?
3) As both cell lines contains 2 replicates, is this limma calculation makes sense? and the reliability of differential expression values (pvalues and foldchanges)
Any valuable inputs or suggestions?
datadir <- 'GSExxx_RAWdata/'
idatfiles <- dir(path=datadir, pattern=".idat")
bgxfile <- dir(path=datadir, pattern=".bgx")
x <- read.idat(paste(datadir,idatfiles,sep=""), paste(datadir,bgxfile,sep = ""))
x$other$Detection <- detectionPValues(x)
**y <- neqc(x)**
## eliminating low detection probes
isexpr <- rowSums(y$other$Detection < 0.05) >= 3
y <- y[isexpr,]
data <- y$E
## add gene symbols
rownames(data) <- y$genes$Symbol
colnames(data) <- c('MD_NC_rep1','MD_NC_rep2',
'MD_siPK1_rep1','MD_siPK1_rep2',
'HT_NC_rep1','HT_NC_rep2',
'HT_siPK1_rep1','HT_siPK1_rep2')
head(data)
i=1; j=i+3; k=j+1; l=k+3
data_MD <- data[,i:j]
data_HT <- data[,k:l]
**design <- model.matrix(~0+factor(c(0,0,1,1)))**
colnames(design) <- c("Wild_type", "Mutant")
# > design
# Wild_type Mutant
# 1 1 0
# 2 1 0
# 3 0 1
# 4 0 1
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$`factor(c(0, 0, 1, 1))`
# [1] "contr.treatment"
**cont_matrix <- makeContrasts(MutvsWT = Mutant-Wild_type, levels=design)**
# > cont_matrix
# Contrasts
# Levels MutvsWT
# Wild_type -1
# Mutant 1
op.matrix <- data
for (i in 1:2){
if (i==1){data=data_MD; Nm="MD_siPK1_vs_MD_NC"; tg=targets[c(1:4),]}
if (i==2){data=data_HT; Nm="HT_siPK1_vs_HT_NC"; tg=targets[c(5:8),]}
fit <- lmFit(data, design)
fit_contrast <- contrasts.fit(fit, cont_matrix)
fit_contrast <- eBayes(fit_contrast)
# Generate a list of top 100 differentially expressed genes
output <- topTreat(fit_contrast, coef=1, number=Inf, adjust.method="BH")
p1 <- output[,"P.Value"]
FC <- output[,"logFC"]
q1 <- p.adjust(p1, method="fdr");
p1 <- signif(p1, digits=2);
q1 <- signif(q1, digits=2);
FC <- round(FC, digits=2) ;
P.matrix <- cbind.data.frame(op.matrix, FC, p1, q1);
colnames(op.matrix)[(ncol(op.matrix)-2):ncol(op.matrix)] <- c(
paste(Nm, "-log2FC", sep=""),
paste(Nm, "-p", sep=""),
paste(Nm, "-FDR", sep=""));
}
op.matrix <- cbind.data.frame('Gene'=rownames(op.matrix),op.matrix)
@Kevin: Thank you for your reply.
By applying neqc normalization, is it the true differential expression is also normalized.
The boxplot , after neqc normalization looks like this https://www.dropbox.com/s/zl57ide3v1nilv8/GSE84962_data_boxplot_neqc.pdf?dl=0
May I know is it good to apply the neqc for 2 different cell lines seperately?
As you know, 2 replicates is not great. Look at it this way: you would really struggle to re-publish this data by just re-processing this single dataset.
Yes you are correct in terms of reproducibility. My top list of differential expression genes is quite different from the genes reported in the published article. So is the reason for this post, if I am doing something wrong with the DE calculations.
Hey.
The differential expression analysis will be performed on the normalised data. Basically, with the pipeline that you are using, the raw data that is returned by
read.idat()
will be an EListRaw object. When you runneqc()
, it will become an EList object, in which the expression values will be normalised.When you use limma on this data (the output of
neqc()
), it will automatically use the normalised expression values.No, you should normalise everything together, i.e., at the same time. This allows for cross-sample comparisons to take place by dealing with (adjusting for) cross-sample biases.
I see. There are probably many reasons for this, including new updated analysis methods and different filter criteria.