Tutorial:Analysing Microarray Data In Bioconductor
0
157
Entering edit mode
12.3 years ago

I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.

I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.

UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.

#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()

# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")

#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)

#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")

#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")

#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")

setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor

#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)

#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)

#Format values to 5 decimal places
rma=format(rma, digits=5)

#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)

#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))

#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)

#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)

This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.

probes    Symbols        Entrez_IDs    GSM678364_B2.CEL    GSM678365_B4.CEL    GSM678366_B5.CEL    ...
7897441    H6PD        9563            6.5943                 7.0552                7.5201    ...
7897449    SPSB1        80176        6.9727                7.0281                7.2285    ...
7897460    SLC25A33    84275        7.6659                7.4289                7.9707    ...
microarray-analysis bioconductor • 74k views
ADD COMMENT
4
Entering edit mode

I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?

ADD REPLY
4
Entering edit mode

I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.

ADD REPLY
3
Entering edit mode

Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.

ADD REPLY
2
Entering edit mode

Hi, my computer gets stuck every time I use the annotation package (the cbind command). Is there any way to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved

ADD REPLY
3
Entering edit mode

This so helpful! I've been trying to figure out all these for months.. Thank you!

ADD REPLY
1
Entering edit mode

No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. P.S., for future reference, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".

ADD REPLY
0
Entering edit mode

Dear Obi,

I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.

Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as <S4 object of class "ExpressionSet"> instead of expression values of the sample. can you please help me how to get expression values

ADD REPLY
3
Entering edit mode

The limma user manual also goes into linear models and is very useful for analysis of illumina microarray data

ADD REPLY
3
Entering edit mode

Very useful. Thanks.

ADD REPLY
2
Entering edit mode

Hi, this was very good tutorial. You said you might do a heatmap.2, which would be great. I am analysing data on this kind of chip and need to do heatmap.2. I would be very grateful to see how you would incorporate heatmap.2 into this tutorial. As I am having difficulty with it.

Thanks for your time

ADD REPLY
2
Entering edit mode

Very good tutorial, thank you

ADD REPLY
2
Entering edit mode

very good tutorial. thanks for sharing

ADD REPLY
1
Entering edit mode

Hi Obi,

I followed the steps you mentioned today and I have a text file now, how can I visualize the data to interpretation please?

ADD REPLY
1
Entering edit mode

Thanks for such a nice tutorial

ADD REPLY
1
Entering edit mode

I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/

ADD REPLY
1
Entering edit mode

Hi, I want to use limma for obtaining list of highly expressed genes in a Dataset(GEO**).

Will someone help me that how to use R studio for this purpose, the type of input file ,the command step by step, etc till the result.

ADD REPLY
2
Entering edit mode

Search Google 'limma user guide'. You shouldn't ask this as an answer to some other post. You can create a new thread but not for the same question as this is very easy to find with google.

ADD REPLY
1
Entering edit mode

To avoid converting the entire data frame to a "character" matrix, I would use cbind.data.frame() instead of cbind():

#Combine gene annotations with raw data
rma=cbind.data.frame(probes,Symbols,Entrez_IDs,rma)
ADD REPLY
1
Entering edit mode

Can you please tell me how to find the best probeset in microarray analysis?

ADD REPLY
1
Entering edit mode

I followed this but I am getting the following error with deseq2:

dds <- DESeq(dds1)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship
Error in estimateDispersionsFit(object, fitType = fitType, quiet = quiet) :
all gene-wise dispersion estimates are within 2 orders of magnitude   from the minimum value, and so the standard curve fitting techniques will not work.
One can instead use the gene-wise estimates as final estimates:
dds <- estimateDispersionsGeneEst(dds)
dispersions(dds) <- mcols(dds)$dispGeneEst
...then continue with testing using nbinomWaldTest or nbinomLRT
ADD REPLY
0
Entering edit mode

This is very useful,but I have a problem in Describing the experiment step,when I am using command.

$ ls data/*.CEL > data/phenodata.txt

This returns an error

$ ls data/*.CEL > data/phenodata.txt
Error: unexpected '$' in "$"

What should I do now?

ADD REPLY
0
Entering edit mode

$ is your terminal's command prompt end. Don't use $ and try.

ADD REPLY
0
Entering edit mode

Thank you for a reply. Actually I do not work with linux. Do you know same of this tutorial for windows?

ADD REPLY
0
Entering edit mode

Hi,

I tried to run this command

data.rma.norm=rma(raw.data)

but unfortunately I faced this error, would you please help me. my platform is GPL96

Error in getCdfInfo(object) : 
  Could not obtain CDF environment, problems encountered:
Specified environment does not contain HGU133A_HS_ENTREZG,HGU133B_HS_ENTREZG
Library - package hgu133ahsentrezg,hgu133bhsentrezgcdf not installed
Bioconductor - hgu133ahsentrezg,hgu133bhsentrezgcdf not available
ADD REPLY
0
Entering edit mode

For RMA normalization

library(affy)

# To read all CEL files in the working directory:
Data<-ReadAffy()

eset<-rma(Data)
norm.data<-exprs(eset)
ADD REPLY
0
Entering edit mode

Hello, I would like to ask is it possible to use similar functions to work with Agilent data? If not, then can someone tell me where can I find something similar to work with Agilent data.

Thank you

ADD REPLY
2
Entering edit mode

Hi Daniel, yes, you can analyse Agilent microarrays. The architecture of the Agilent arrays is fundamentally different than others, though, so the processing is different too. Here is some sample code for you, which should assist:

source("http://bioconductor.org/biocLite.R")
options(scipen=999)
require("limma")

#Read in the data into a dataframe
#readTargets will by default look for the 'FileName' column in the spcified file
targetinfo <- readTargets("Targets.txt", sep="\t")

#Converts the data to a RGList (two-colour [red-green] array), with values for R, Rg, G, Gb
project <- read.maimages(targetinfo, source="agilent")

#Perform background correction on the fluorescent intensities
#'normexp' is beneficial in that it doesn't result in negative values, meaning no data is lost
project.bgcorrect <- backgroundCorrect(project, method="normexp", offset=16)

#Normalize the data with the 'loess' method
#LOESS performs local regression on subsets of the data, resulting in the generation of a 'regression curve'
project.bgcorrect.norm <- normalizeWithinArrays(project.bgcorrect, method="loess")

#For replicate probes in each sample, replace values with the average
#ID is used to identify the replicates
project.bgcorrect.norm.avg <- avereps(project.bgcorrect.norm, ID=project.bgcorrect.norm$genes$ProbeName)

#Write out the expression values (using CEL file name as headers)
wObject <- project.bgcorrect.norm.avg$M
rownames(wObject) <- project.bgcorrect.norm.avg$genes$GeneName
temp <- temp[order(rownames(wObject), decreasing=FALSE),]
write.table(wObject, "Results/MasterExpression.tsv", sep="\t", quote=FALSE, append=FALSE)

#Summarise over gene names (averaging counts over transcripts)
temp <- data.frame(cbind(data.frame(project.bgcorrect.norm.avg$genes$GeneName), data.frame(project.bgcorrect.norm.avg$M)))
project.bgcorrect.norm.avg.summarised <- aggregate(temp[,2:ncol(temp)], temp[1], mean)
rownames(project.bgcorrect.norm.avg.summarised) <- project.bgcorrect.norm.avg.summarised[,1]
project.bgcorrect.norm.avg.summarised <- project.bgcorrect.norm.avg.summarised[,2:ncol(project.bgcorrect.norm.avg.summarised)]
write.table(project.bgcorrect.norm.avg.summarised, "Results/MasterExpressionAverages.tsv", sep="\t", quote=FALSE, append=FALSE)
ADD REPLY
0
Entering edit mode

Oh my, such a quick response! Thank you very much!

ADD REPLY
1
Entering edit mode

You're welcome! Hope that it helps.

ADD REPLY
0
Entering edit mode

Kevin I need a little bit more help :D, I am newbie at bioinformatics and especially at working with R.

When I use function:

targetinfo <- readTargets("Targets.txt", sep="\t")

I get this error:

Error in read.table(file, header = TRUE, stringsAsFactors = FALSE, sep = sep,  : 
   more columns than column names

I am working with GSE75915 data.

ADD REPLY
2
Entering edit mode

daniel.naumovas: Have you considered using GEO2R which is a built-in analysis application available from NCBI GEO?

ADD REPLY
1
Entering edit mode

I was going to suggest something like that because the files that the group have made available are not recognised as they don't have the expected column names for the 2-colour Agilent array. It looks like they have been processed to a certain point.

I can't make my own data available as it's protected by agreements.

ADD REPLY
0
Entering edit mode

I have tried GEO2R, but you cant do normalization with it :/ Need normalized data to work with WGCNA

ADD REPLY
2
Entering edit mode

Hello again, the data is already normalised.

I download the normalised [edit:] intensities from here: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE75nnn/GSE75915/matrix/

GSE75915 <- read.csv("GSE75915_series_matrix.txt", skip=76, header=TRUE, sep="\t", stringsAsFactors=FALSE)
GSE75915 <- GSE75915[grep("A_", GSE75915[,1]),] #Indirectly remove control probes
rownames(GSE75915) <- GSE75915[,1]
GSE75915 <- GSE75915[,-1]
dim(GSE75915)
[1] 34137    18
boxplot(GSE75915, pch=".")
hist(data.matrix(GSE75915), breaks=150)

boxplot hist

If you want to annotate these probe IDs, I suggest that you do it manually by downloading the human annotation from Agilent's website, here. Alternatively, take a look at the annotate package in R.

Kevin

ADD REPLY
1
Entering edit mode

I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:

101010 Button

ADD REPLY
0
Entering edit mode

Hello, my targets file just contains this (tab-delimited):

FileName               WT_KO  Time
SampleFiles/File1.txt  WT     4Wk
SampleFiles/File2.txt  KO     4Dy
SampleFiles/File3.txt  KO     7Dy
SampleFiles/File4.txt  WT     4Wk
..
..

The files (txt or CEL) are in a directory called SampleFiles. I think that the column name has to be FileName, whist the others can be anything (these others are used for limma).

Does that help?

ADD REPLY
0
Entering edit mode

Hmm I cant think which column should be named FileName. Maybe your data is available in GEO db so I could use it as example?

ADD REPLY
0
Entering edit mode

Hello Dear Obi Griffith,

I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file rma.txt how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?

Note: I am working with two normal states and tumor, how to compare the expression between them?

ADD REPLY
0
Entering edit mode

Hi! Does somebody know if the tutorial from the Bioinformatics KnowledgeBlog is still available somewhere? I cannot open it anymore and I am struggling with the exact same task.. Thanks a lot!

ADD REPLY
0
Entering edit mode

Hello sir,

Could you please suggest to me any R script for microarray miRNA analysis of the Exiqon miRCURY LNA microRNA array dataset?

Thank you.

ADD REPLY

Login before adding your answer.

Traffic: 1610 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6