Processing GEO data and calculating differential expression using limma
1
1
Entering edit mode
5.2 years ago
APJ ▴ 40

Hello,

I am wondering about the data analysis strategy implemented to calculate the differential expression analysis between wildtype and knockout samples for 2 cell lines. It is a Illumina bead array chip data obtained from [GEO]. I have total of 4 samples for each cell line (2 controls, 2 treated). End output I would like to obtain is foldchange and pvalues; for comparisons done between controls and treated samples from 2 cell lines. Output: B_vs_A_log2FC B_vs_A_pvalue D_vs_C_log2FC D_vs_C_pvalue

I have my code (below), that could perform my task.

But I am wondering about 3 questions:

1) Is it correct to apply neqc function to normalize the expression values of .idat files

2) My design matrix without weights?

3) As both cell lines contains 2 replicates, is this limma calculation makes sense? and the reliability of differential expression values (pvalues and foldchanges)

Any valuable inputs or suggestions?

datadir <- 'GSExxx_RAWdata/' 
    idatfiles <- dir(path=datadir, pattern=".idat")
    bgxfile   <- dir(path=datadir, pattern=".bgx")
    x         <- read.idat(paste(datadir,idatfiles,sep=""), paste(datadir,bgxfile,sep = ""))
    x$other$Detection <- detectionPValues(x)
    **y <- neqc(x)**

## eliminating low detection probes
isexpr <- rowSums(y$other$Detection < 0.05) >= 3
y <- y[isexpr,]
data <- y$E
## add gene symbols
rownames(data) <- y$genes$Symbol
colnames(data) <- c('MD_NC_rep1','MD_NC_rep2',
                    'MD_siPK1_rep1','MD_siPK1_rep2',
                    'HT_NC_rep1','HT_NC_rep2',
                    'HT_siPK1_rep1','HT_siPK1_rep2')
head(data)

i=1; j=i+3; k=j+1; l=k+3
data_MD <- data[,i:j]
data_HT <- data[,k:l]
**design <- model.matrix(~0+factor(c(0,0,1,1)))**
colnames(design) <- c("Wild_type", "Mutant")

# > design
# Wild_type Mutant
# 1         1      0
# 2         1      0
# 3         0      1
# 4         0      1
# attr(,"assign")
# [1] 1 1
# attr(,"contrasts")
# attr(,"contrasts")$`factor(c(0, 0, 1, 1))`
# [1] "contr.treatment"


**cont_matrix <- makeContrasts(MutvsWT = Mutant-Wild_type, levels=design)**
# > cont_matrix
# Contrasts
# Levels      MutvsWT
# Wild_type      -1
# Mutant          1

op.matrix <- data
for (i in 1:2){
  if (i==1){data=data_MD; Nm="MD_siPK1_vs_MD_NC"; tg=targets[c(1:4),]}
  if (i==2){data=data_HT; Nm="HT_siPK1_vs_HT_NC"; tg=targets[c(5:8),]}  

  fit <- lmFit(data, design)  
  fit_contrast <- contrasts.fit(fit, cont_matrix)
  fit_contrast <- eBayes(fit_contrast)
  # Generate a list of top 100 differentially expressed genes
  output <- topTreat(fit_contrast, coef=1, number=Inf, adjust.method="BH")
  p1 <- output[,"P.Value"]
  FC <- output[,"logFC"]
  q1 <- p.adjust(p1, method="fdr");
  p1 <- signif(p1, digits=2);
  q1 <- signif(q1, digits=2);
  FC <- round(FC, digits=2) ;
  P.matrix <- cbind.data.frame(op.matrix, FC, p1, q1);
  colnames(op.matrix)[(ncol(op.matrix)-2):ncol(op.matrix)] <- c(
    paste(Nm, "-log2FC", sep=""),
    paste(Nm, "-p", sep=""),
    paste(Nm, "-FDR", sep=""));
}

op.matrix <- cbind.data.frame('Gene'=rownames(op.matrix),op.matrix)
R limma differential expression beadarray GEO • 1.5k views
ADD COMMENT
1
Entering edit mode
5.2 years ago

1) Is it correct to apply neqc function to normalize the expression values of .idat files

Yes, neqc() is fine for this data - it performs background correction via normexp, and then quantile normalises.

2) My design matrix without weights?

Only use weights in the formula if you have justification; otherwise, do not use weights for no good reason.

3) As both cell lines contains 2 replicates, is this limma calculation makes sense? and the reliability of differential expression values (pvalues and foldchanges)

As you know, 2 replicates is not great. Look at it this way: you would really struggle to re-publish this data by just re-processing this single dataset.

ADD COMMENT
0
Entering edit mode

@Kevin: Thank you for your reply.

By applying neqc normalization, is it the true differential expression is also normalized.

The boxplot , after neqc normalization looks like this https://www.dropbox.com/s/zl57ide3v1nilv8/GSE84962_data_boxplot_neqc.pdf?dl=0

May I know is it good to apply the neqc for 2 different cell lines seperately?

As you know, 2 replicates is not great. Look at it this way: you would really struggle to re-publish this data by just re-processing this single dataset.

Yes you are correct in terms of reproducibility. My top list of differential expression genes is quite different from the genes reported in the published article. So is the reason for this post, if I am doing something wrong with the DE calculations.

ADD REPLY
0
Entering edit mode

Hey.

By applying neqc normalization, is it the true differential expression is also normalized.

The differential expression analysis will be performed on the normalised data. Basically, with the pipeline that you are using, the raw data that is returned by read.idat() will be an EListRaw object. When you run neqc(), it will become an EList object, in which the expression values will be normalised.

When you use limma on this data (the output of neqc()), it will automatically use the normalised expression values.

May I know is it good to apply the neqc for 2 different cell lines seperately?

No, you should normalise everything together, i.e., at the same time. This allows for cross-sample comparisons to take place by dealing with (adjusting for) cross-sample biases.

Yes you are correct in terms of reproducibility. My top list of differential expression genes is quite different from the genes reported in the published article. So is the reason for this post, if I am doing something wrong with the DE calculations.

I see. There are probably many reasons for this, including new updated analysis methods and different filter criteria.

ADD REPLY

Login before adding your answer.

Traffic: 2139 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6