Question

Analysis of GEO2R

0

Entering edit mode

7.0 years ago

mannoulag1 ▴ 130

Hi, Can I continue the analysis after obtening the toptable of differentially expressed genes using the R code of GEO2R or I should do my own analysis? I have many samples in my dataset, how can I define the groups in GEO2R because it influence the result of the toptable. The normalization in GEO2R is the Log transformation ?? Thank you in advance

R gene geo2R • 5.0k views

ADD COMMENT • link updated 7.0 years ago by Kevin Blighe 88k • written 7.0 years ago by mannoulag1 ▴ 130

score 3 · Answer 1 · 2017-12-29

3

Entering edit mode

7.0 years ago

Kevin Blighe 88k

GEO2R is just an interface between R and the Gene Expression Omnibus (GEO) database. It usually allows you to directly import data that is already normalised. The normalisation type, if Affymetrix microarray data, will most likely consist of a 3-step process known as RMA (Robust Multiarray Average) normalisation:

background correction
quantile normalisation
log base 2 transformation

When you import this data into R, you can do anything that you want with it. Usually people fit a linear model with limma and then extract the statistical representations of this model via the toptable() function.

If you wish to merge datasets ('groups'?) together, then they should ideally be from the same microarray version.

Further questions, ask away.

Kevin

ADD COMMENT • link 5.4 years ago by Kevin Blighe 88k

0

Entering edit mode

Hi Kevin and Happy new year 2018 , Thank you for your answer ,

ok, for the normalisation I already did it with the function rma in my R code, this give the same result? and also I tested the gcrma normalisation with the function gcrma, what is the best method for normalisation? for defining groups: I means that in GEO2R we can group samples of a GSE dataset to do differential expression analysis, how can I do this? I should group all the samples in groups or some samples?

ADD REPLY • link 7.0 years ago by mannoulag1 ▴ 130

1

Entering edit mode

Hi! Happy New Year!

Yes, RMA and gcRMA will both perform the background correction, quantile normalisation, and log transformation automatically for you. gcRMA is slightly different in that it takes into account biases related to the DNA bases G and C, which can, for example, affect the annealing temperature of a probe to the template DNA. It is better to use gcRMA, as it will result in less criticism of your work.

For the other question, you should only combine samples if they are from the same microarray, for example, Affymetrix HuGene, etc.

How many GEO datasets are you analysing?; are they all based on the same microarray?

In order to conduct the differential expression analysis, limma is used. It involves the construction of a 'design matrix' based on your groups of interest and then fitting a linear model to your data based on this model. Then, statistical values can be obtained from the linear fit.

There are many, many limma tutorials online. I can help here but I just need to understand better your experimental set-up.

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k

0

Entering edit mode

thank you, I am analysing this dataset from the Affymetrix ATH1 microarray: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE5632

ADD REPLY • link 7.0 years ago by mannoulag1 ▴ 130

2

Entering edit mode

Thanks, coincidentally, that group is based very near where I did my PhD (in the 'heart' of England).

Why have you chosen this particular experiment? - i.e., in which facet of Arabidopsis spp. are you interested? According to the authors, the study relates to the AtGenExpress project and their samples consist of data from "different tissues and different developmental stages in wild type Columbia (Col-0) and various mutants".

I was able to find metadata in the Series Matrix File. So, if I wanted to compare the various mutants to wild-types, I would first have to produce my own metadata file:

metadata <- read.table("Metadata.tsv", header=TRUE, sep="\t", stringsAsFactors=FALSE)
head(metadata)
   SampleID1 SampleID2  Type                 FlowerStage
ATGE_31_A2 GSM131576 Col-0     Tissue: flowers stage 9
ATGE_31_B2 GSM131577 Col-0     Tissue: flowers stage 9
ATGE_31_C2 GSM131578 Col-0     Tissue: flowers stage 9
ATGE_32_A2 GSM131579 Col-0 Tissue: flowers stage 10/11
ATGE_32_B2 GSM131580 Col-0 Tissue: flowers stage 10/11
ATGE_32_C2 GSM131581 Col-0 Tissue: flowers stage 10/11

unique(metadata$Type)
[1] "Col-0"  "clv3-7" "lfy-12" "ap1-15" "ap2-6"  "ap3-6"  "ag-12"  "ufo-1"

Limma does not like hyphens, so, we'll have to erase these:

metadata$Type <- gsub("-", "", metadata$Type)

I also should have the expression data. In your case, this would have been produced by rma / gcrma. In you case, you may have to access the expression counts by using the expr() function on the object produced by your rma / gcrma function.

expression[1:5,1:4]
          GSM131576  GSM131577  GSM131578 GSM131579
244901_at  24.02617  26.973778  27.886177  24.01950
244902_at  24.58554  31.889400  32.521889  29.33675
244903_at 331.31256 333.799683 313.053986 289.04169
244904_at  46.36324  53.558990  56.662281  41.90237
244905_at   7.35668   4.833285   6.770735  13.32703

It's then key to match the order of samples in your expression matrix with the order of vaues in the metadata. In my example, they match:

colnames(expression) == metadata$SampleID2
 [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[46] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
[61] TRUE TRUE TRUE TRUE TRUE TRUE

If they do not match, you will have to re-order one or the other using match() or some other function. The remainder is then easy:

#Create the study design and comparison model
design <- paste(metadata$Type, sep="")
design <- factor(design)
comparisonmodel <- model.matrix(~0+design)
colnames(comparisonmodel) <- levels(design)

#Checking the experimental design
design
comparisonmodel

#Fit the linear model
project.fit <- lmFit(expression, comparisonmodel)

#Applying the empirical Bayes method
project.fit.eBayes <- eBayes(project.fit)
names(project.fit.eBayes)

#Compare the ap1-15 mutant to the wild type
ap1 <- makeContrasts(ap1="ap115-Col0", levels=comparisonmodel)
ap1.fitmodel <- contrasts.fit(project.fit.eBayes, ap1)
ap1.fitmodel.eBayes <- eBayes(ap1.fitmodel)

#Display top 15 statistically significant results
topTable(ap1.fitmodel.eBayes, coef="ap1", adjust.method="BH", number=15)
              logFC    AveExpr         t      P.Value    adj.P.Val        B
257529_at 139.33751   7.490600 93.657309 1.423240e-65 3.246411e-61 9.337236
247530_at 143.95661  27.784289 42.271545 1.248731e-45 1.424178e-41 8.881616
264216_at  43.49803   3.706805 25.810794 1.023969e-33 7.785581e-30 8.010667
258818_at  90.30845  45.242973 16.280347 2.003156e-23 1.142300e-19 6.308462
259309_at  66.50663  44.491757 15.954609 5.287536e-23 2.412174e-19 6.208402
258848_at 705.69414 180.011325 15.519568 1.971541e-22 7.495142e-19 6.067947
259047_at 303.34628 300.413848 14.796651 1.847857e-21 6.021373e-18 5.815963
258595_at  83.76634  82.451594 14.414840 6.184526e-21 1.763363e-17 5.672762
258863_at 152.21179  71.865322 12.996737 6.458439e-19 1.636855e-15 5.071508
259050_at  89.50999  59.268068 11.867062 3.150331e-17 7.185904e-14 4.502501
257518_at  64.90205  51.490512 11.781766 4.252902e-17 8.818973e-14 4.455890
259051_at 206.36788 155.576320 11.326456 2.143141e-16 4.073755e-13 4.197786
259346_at 387.47796  82.546384 10.566023 3.376299e-15 5.924106e-12 3.729969
258801_at 114.31843 100.823329 10.329159 8.079099e-15 1.316316e-11 3.574355
258844_at 142.77408 131.234125 10.295457 9.151600e-15 1.391653e-11 3.551819

ADD REPLY • link 7.0 years ago by Kevin Blighe 88k