Next-generation sequencing (NGS) has been developed, resulting in the generation of numerous studies. Each study typically involves small sample sizes, prompting many bioinformatic researchers (or other scientists) to combine similar (or related) samples to increase the overall sample size. This approach offers the benefit of obtaining a sample group closer to the population, thereby enabling more statistically significant results.
However, a challenge arises due to batch effects, such as condition variations, read counts, platforms, etc. If batch effects between studies cannot be removed, it can lead to a failure to obtain statistically meaningful results. Due to these reasons, many tools have been developed to assist in removing batch effects.
I understand that meta-analysis is a process in which similar research problems are appropriately combined statistically by removing batch effects, to obtain more meaningful results from existing (known) results. I know that such meta-analysis can be applied to RNA-seq, microarray, and even single-cells. However, I have only conducted meta-analysis separately for RNA-seq and microarray data, and have not attempted to combine them in a single meta-analysis.
In microarray, We can download raw data in .CEL format from DB (GEO, ArrayExpress, etc.) and obtain the expression matrix using the code below (using *oligo* package).
# Example code
# E-MEXP- 2681 annotation
m_2681_anno <- read.delim('E-MEXP-2681/E-MEXP-2681.sdrf.txt', header = T, stringsAsFactors = F)
m_2681_anno <- m_2681_anno[,c('Source.Name', 'Characteristics.DiseaseState.')]
m_2681_anno <- m_2681_anno[m_2681_anno$Characteristics.DiseaseState.!='polymyositis',]
rownames(m_2681_anno) <- NULL
colnames(m_2681_anno) <- c('Sample', 'State')
m_2681_anno$State <- ifelse(m_2681_anno$State == 'normal', 'Normal', 'Disease')
m_2681_anno$Sample <- paste0(m_2681_anno$Sample, '.CEL')
# E-MEXP-2681 expression # Affymetrix GeneChip Human Genome HG-U133A [HG-U133A]
m_2681_cel <- read.celfiles(paste0('E-MEXP-2681/', m_2681_anno$Sample), pkgname = 'pd.hg.u133a')
m_2681_rma <- oligo::rma(m_2681_cel)
m_2681_exp <- exprs(m_2681_rma)
# Affymetrix GeneChip Human Genome HG-U133A [HG-U133A] -> ensembl
u133a <- getBM(attributes = c('affy_hg_u133a_2', 'ensembl_gene_id'),
filters = 'affy_hg_u133a_2',
values = rownames(m_2681_exp),
mart = ensembl_hs)
In the case of microarray, when it uses a different platform, the probe ID can also be different. So we need to check the data platform probe IDs. In my case, when unifying multiple studies, I use a common probe ID to create a single expression matrix for analysis. Utilizing only genes common to each study may lead to loss of data, but this aspect seems to require further discussion. After, when creating 'merge_exp' and 'merge_anno', which represent the expression matrix and annotation, use the following code to analyze differential expression (using *sva and limma* packages). sva::Combat() can help to remove batch effect while limma works for differential expression analysis using a linear model.
# Example code
dim(merge_exp)
# 22277 63
dim(merge_anno)
# 63 2
colnames(merge_anno)
# 'State' 'Batch'
all(rownames(merge_anno) == colnames(merge_exp))
# TRUE
head(merge_anno$State)
# Normal Disease Disease Normal Normal Normal
# Levels: Disease Normal
head(merge_anno$Batch)
# 1 1 1 1 1 1
# Levels: 1 2
model_mat <- model.matrix(object = ~1, data = merge_anno)
combat_merge_exp <- sva::Combat(dat = merge_exp, batch = merge_anno$Batch, mod = model_mat, par.prior = T)
# Differential expression analysis (Use limma)
design <- model.mat(object = ~0+State, data = merge_anno)
colnames(design) <- c('Disease', 'Normal')
log_combat_merge_exp <- log2(combat_merge_exp + 0.1)
fit <- limma::lmFit(object = log_combat_merge_exp, design = design)
contrasts <- limma::makeContrasts('DvsC' = Disease - Normal, levels = design)
contrast.fit <- limma::contrasts.fit(fit = fit, contrasts = contrasts)
eBayes <- limma::eBayes(contrast.fit, trend = T, robust = T)
result <- limma::topTable(eBayes, coef = 'DvsC', n = Inf, adjust.method = 'BH')
The raw data of RNA-seq can be downloaded through the Sequence Read Archive (SRA), or sometimes raw counts are uploaded to GEO or ArrayExpress. However, some datasets provide normalized matrices instead of raw count matrices, which can complicate analysis. Therefore, it is preferable to proceed with raw SRA data use, and there are many tools available to assist in this process. Assuming we have obtained raw counts of data, we can use DESeq2, which required raw counts for analysis. In performing meta-analysis of RNA-seq data, it is also necessary to address batch effects, which can be achieved using sva::Combat_seq(). Below is the code demonstrating batch effect removal and differential expression analysis:
# Example code
dim(merge_exp)
# 43277 45
dim(merge_anno)
# 45 2
colnames(merge_anno)
# 'State' 'Batch'
head(merge_anno$State)
# Normal Disease Disease Normal Normal Normal
# Levels: Disease Normal
head(merge_anno$Batch)
# 1 1 1 1 1 1
# Levels: 1 2 3
all(rownames(merge_anno) == colnames(merge_exp))
# TRUE
# Remove batch effects using Combat_seq
combat_merge_exp <- sva::Combat_seq(merge_exp, batch = merge_anno$Batch)
ddsobj <- DESeqDataSetFromMatrix(countData = round(combat_merge_exp),
colData = merge_anno,
design = ~State)
filt <- min(matrixStats::count(merge_anno$State == 'Disease', matrixStats::count(merge_anno$State == 'Normal')))
keep <- rowSums(counts(ddsobj) >= 10) >= filt
ddsobj <- ddsobj[keep,]
# Differential expression analysis (Use DESeq2)
res <- DESeq(ddsobj)
res <- results(res, contrast = c('State', 'Disease', 'Normal')) %>% data.frame()
I've just described the process of conducting meta-analysis for each dataset. Now, I'm seeking advice or answers to questions regarding my research methods. Are there any experienced researchers who could provide guidance?
I'm curious if there are any issues or parts that need to be added to the code for the methods I've used in my analysis (code).
I've conducted differential expression analysis in meta-analysis using limma for microarray data. I'm curious if geneMeta would be more effective.
If the data I intend to use are uploaded as raw counts, can I directly apply Combat_seq() to them? Naturally, it would be more accurate to proceed from SRA data, but is this approach acceptable?
The most important part of this post. How to approach conducting meta-analysis simultaneously for microarray and RNA-seq data? Given that the two expression matrices have different formats, it seems challenging to use DESeq. I'm curious about how to unify them and create merged data that has had batch effects removed.
I have learned that normalization and batch correction are different. Normalization typically refers to removal of technical biases between samples, while batch correction involves removal of both technical biases and biological differences between batches. Could you provide a more accurate definition than this one?
When we need to map probe IDs to specific genes, there are various options such as converting them to Gene symbols, Ensembl IDs, entrez IDs, etc. What is the most appropriate approach during the analysis process? Is it best to perform this conversion before applying it to the model, after applying it to all, or is it acceptable to apply it differently depending on the requirements of the process?
I hope this question will be helpful to many researchers, and I would greatly appreciate it if you could provide an answer that could benefit me and numerous others.