If the array versions are the same

Question

How to integrate multiple data sets from microarray platform prior meta-analysis?

9

Entering edit mode

7.1 years ago

rice.researcher ▴ 220

Efficient method(s) for comparison of data sets that are generated from different microarray experiments (meta-analysis) have been asked previously. Based on the discussion, I found RankProd R package. However, after using the method, I wanted to compare the results with an another method since RankProd was developed years back. I wold like to know any alternative method available for integrating multiple data set(RNASeq not included).

Out of 10 data sets, 8 are from Affymetrix and 2 are Agilent. Expression profiles are based on different developmental stages of an organ in Arabidopsis.

microarraay Microaray R Meta-analysis • 8.9k views

ADD COMMENT • link updated 7.1 years ago by svlachavas ▴ 790 • written 7.1 years ago by rice.researcher ▴ 220

2

Entering edit mode

7.1 years ago

svlachavas ▴ 790

Just to add some crusial comments to further extend the comprehensive answer of Kevin:

1) Firstly, regarding the two different platforms of microarrays used: are the same for each platform ? for instance in affymetrix you have mentioned from the link above that is affymetrix hgu133a-but there are also some other different "sub-platforms" ? are also the experimental conditions similar ? or you have evident variations in the experimental design concering the developmental stages in Arabidopsis, that could lead to a clear batch effect ? In other words, generally combining any of the datasets (even the similar affymetrix platform), you would have to construct a rather-complicated experiment to account for experiment/study-specific effects (as also some other potential problems with normalization, variance estimation etc.)

2) In my opinion, a first "basic and powerful" approach-if again you have similar experimental designs and biological questions-, would be to perform each DE analysis for each dataset separately. Then:

A) I would initially compare the DE probes-or more appropriately annotate to gene symbols-, to find any "common genuine DE genes" that are characterized constantly, between different datasets, or experiments. As also, possible differences.

2) In parallel, you could next perform a kind of "functional-enrichment meta-analysis"-again for each of your separate DE lists, conduct some "GO/KEGG" analysis, and inspect for common biological pathways or biological processes appeared in different datasets.

Finally, if you like to try the approach of merging, you could follow the instructions above, and perhaps perform probably a batch effect correction with ComBat with R package sva, using as a known covariate the different experimental study.

(*Regarding RankProd, it is another possibility, but again i would suggest the R package RankAggreg, which seems more appropriate regarding your approach: you would have to analyze each dataset separately, keep the topk ranked genes by a criterion, and then perform a similar analysis to keep the most "informative genes".)

Hope that helps,

Efstathios

ADD COMMENT • link 7.1 years ago by svlachavas ▴ 790

1

Entering edit mode

Good answer Efstathios!

ADD REPLY • link 7.1 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for your views and suggestion. Affymetrix array are of different platforms with different experimental design. But I am not into DE and enrichment analysis other than data integration.Yes, those R packages are new to me and I would use it for comparison.

ADD REPLY • link 7.0 years ago by rice.researcher ▴ 220

score 15 · Accepted Answer · 2017-11-13

15

Entering edit mode

7.1 years ago

Kevin Blighe 88k

Updated February 29, 2020

Updated October 9, 2018

-----------------------------------------

If the array versions are the same

If the Affymetrix arrays are all of the same version/type, then just process them all together at the same time. Same for the Agilent arrays. Afterwards, use one as the training dataset and the other as the validation dataset, i.e., obtain results from one, and then corroborate these [results] in the other.

You could attempt to perform statistical analyses on them combined, too (i.e. not as training and validation), but, in this case, include ArrayVersion (i.e. batch) as a covariate in downstream analyses - there will likely be some batch effect.

If all data-set arrays are different

Process each independently
Summarise expression values across whole genes independently
Create a list of common features/genes across all arrays
Convert the data to Z-scores independently
Merge (bind) the data together
Build regression models predicting for your outcome and include 'ArrayVersion' as a covariate in the model

My advice is to avoid the use of programs that aim to directly correct for batch. It is preferable to include batch as a covarate in your statistical models.

---------------

NB - a word of caution about these approaches: these are not 'fool proof'. If you are trying to analyse, e.g., brain tissue and skin tissue together, then the expression profiles will be so different such that you could not analyse them together in any way. Proceed with caution with these approaches that I mention, and take time to analyse the output via histograms and box-and-whisker plots.

ADD COMMENT • link 4.8 years ago by Kevin Blighe 88k

0

Entering edit mode

Thanks for your suggestion!. My Affymetrix data sets are from different version and I tried the second choice. I already normalized all data sets separately and got the intensity values. Data sets were merged based on common probes. But I couldn't understand the last point as my statistical experience is limited. Is it possible to give more on it?

ADD REPLY • link 7.0 years ago by rice.researcher ▴ 220

0

Entering edit mode

Hi rice.researcher, on the last point, you could just build a multinomial logistic regression model and comparing each gene independently, whilst adjusting for 'batch' / 'array experiment':

In R Programming Language, it would be:

#Factorise your outcome variable and set 'StageI' as the reference level
DevelopmentalStage <- factor(DevelopmentalStage, levels=c("StageI", "StageII", ..., "StageX"))

#Build the model
model <- glm(DevelopmentalStage ~ Gene1 + ArrayExperiment, data=MyData, family=binomial(link="logit"))

#Get P value and estimates (beta coefficients)
summary(model)

MyData will be a data-frame that contains 1 column for DevelopmentalStage, another for ArrayExperiment (e.g. Affy1, Affy2, ..., Affy8, Agilent1, Agilent2), and then gene expression values. Samples are on rows, and obviously sample names for DevelopmentalStage and ArrayExperiment have to match those of your genes.

The model is essentially testing the hypothesis that the gene's expression is associated with the different developmental stages, but the stats will be adjusted based on ArrayExperiment, i.e., ArrayExperiment is a covariate for which we must adjust. They do similar things in large clinical trials, like adjusting for BMI, smoking status, race/ethnicity, household income, etc.

The only issue is that you'll have to run this over the entire dataset for each gene, and then output the stats values. I have posted some code here (scroll down a bit) on how to do this: R functions edited for parallel processing (parallelised for multi-core processing)

However, here is some simpler code (not parallelised): Assuming that your data is in a data-frame called modelling, with the first two columns being DevelopmentalStage and ArrayExperiment:

j <- 1
write.table(c("Gene\tBeta\tStandard.Error\tZ.Score\tOR\tp.value"), "Results.tsv", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE, append=FALSE)
for (i in 3:ncol(modelling))
{
    formula <- as.formula(paste("DevelopmentalStage ~ ArrayExperiment +  ", colnames(modelling)[i], sep=""))
    model <- glm(formula, modelling, family=binomial(link="logit"))

    Beta <- coef(summary(model))[,1][[8]]
    stderror <- coef(summary(model))[,2][[8]]
    Z <- coef(summary(model))[,3][[8]]
    OR <- exp(cbind(OR=coef(model)))[8,1]
    p <- coef(summary(model))[,4][[8]]

    wObject <- data.frame(colnames(modelling)[i], Beta, stderror, Z, OR, p)
    write.table(wObject, "Results.tsv", sep="\t", quote=FALSE, col.names=FALSE, row.names=FALSE, append=TRUE)
    p[j] <- wObject[6]
    j <- j + 1

    if (j %% 500 == 0)
    {
        print(paste(j, " transcripts processed", sep=""))
    }
}
print(paste("Total models, counter 1: ", i-2, sep=""))
print(paste("Total models, counter 2: ", j-1, sep=""))

The only thing that you'll have to change to adapt this to your own code is the '[[8]]', which is essentially the row number for your gene in the results produced by summary()

Failing this, you can try to use limma and aim to include ArrayExperiment as a covariate that way.

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

Very well explained !!. I am looking into it.

ADD REPLY • link 7.0 years ago by rice.researcher ▴ 220

0

Entering edit mode

Hi SInce I am facing some issues while merging datasets from different versions of Affymetrix,can you please help me how did u merge those datasets..any specific R package or tool you used

ADD REPLY • link 5.1 years ago by ghataksoumyakanti • 0