I was thinking about creating a tutorial on how to do a simple microarray analysis in Bioconductor. But, I realized this has already been done quite nicely at the Bioinformatics Knowledgeblog. Their first tutorial on the subject covers installation of necessary packages, downloading of cel files, describing the experiment, loading and normalizing data, quality controls, probe set filtering, finding differentially expressed probe sets, and finally annotating those probe sets to gene symbols. They have follow-up tutorials on visualizing microarray data with volcano plots and assessing the biological significance of results. All three seem quite excellent after a brief perusal.
I might have also included: filtering with 'genefilter', multiple testing correction with 'multtest', and visualizations with 'heatmap.2'. Let me know if you would like to see tutorials specifically on any of those topics.
UPDATE: Since this seems to be coming up with some regularity I decided to add some sample code for importing CEL files, normalizing, and annotating to Entrez Gene ID and Symbol, even though this overlaps largely with the tutorial linked above. It illustrates a few slightly different things and also gives a concrete example for the 'Affymetrix Human Gene 1.0 ST Array'.
#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()
# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")
#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)
#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")
#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")
#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor
#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)
#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)
#Format values to 5 decimal places
rma=format(rma, digits=5)
#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)
#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))
#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)
#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.
probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...
7897441 H6PD 9563 6.5943 7.0552 7.5201 ...
7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...
7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...
I wanted to perform WGCNA analysis on Affy cell files. So according to this tutorial, the "rma.txt" could be used as the input for WGCNA ?
I've not used the WGCNA R package. But, if it starts with a gene expression matrix as input, then yes, it seems like you could use the RMA normalized gene expression values for your analysis.
Thanks..yea, the first column has the list of identifiers followed by the expression profiles of all the samples.
Hi, my computer gets stuck every time I use the annotation package (the cbind command). Is there any way to overcome that? I have also tried allotting heavy memory on our computing cluster and yet the problem is unsolved
This so helpful! I've been trying to figure out all these for months.. Thank you!
No problem. Although, thanks should go to the good people at the Bioinformatics Knowledgeblog who created these tutorials. P.S., for future reference, comments/thanks like this would be better as a "comment" to the main post rather than as an "answer".
Dear Obi,
I have normalized my affymetrix microarray data and did gene filter to remove probes that do not have entrez id's.
Now I want to annotate my data for gene symbols. I've followed your steps to retrieve gene symbols. After writing the data in text file I found a column mentioning as
<S4 object of class "ExpressionSet">
instead of expression values of the sample. can you please help me how to get expression valuesThe limma user manual also goes into linear models and is very useful for analysis of illumina microarray data
Very useful. Thanks.
Hi, this was very good tutorial. You said you might do a
heatmap.2
, which would be great. I am analysing data on this kind of chip and need to doheatmap.2
. I would be very grateful to see how you would incorporateheatmap.2
into this tutorial. As I am having difficulty with it.Thanks for your time
Very good tutorial, thank you
very good tutorial. thanks for sharing
Hi Obi,
I followed the steps you mentioned today and I have a text file now, how can I visualize the data to interpretation please?
Thanks for such a nice tutorial
I found also this It may be helpful http://learn.genetics.utah.edu/content/labs/microarray/
Hi, I want to use limma for obtaining list of highly expressed genes in a Dataset(GEO**).
Will someone help me that how to use R studio for this purpose, the type of input file ,the command step by step, etc till the result.
Search Google 'limma user guide'. You shouldn't ask this as an answer to some other post. You can create a new thread but not for the same question as this is very easy to find with google.
http://bioinformatics.knowledgeblog.org/2011/06/20/analysing-microarray-data-in-bioconductor/
To avoid converting the entire data frame to a "character" matrix, I would use
cbind.data.frame()
instead ofcbind()
:Can you please tell me how to find the best probeset in microarray analysis?
I followed this but I am getting the following error with
deseq2
:This is very useful,but I have a problem in Describing the experiment step,when I am using command.
This returns an error
What should I do now?
$
is your terminal's command prompt end. Don't use$
and try.Thank you for a reply. Actually I do not work with linux. Do you know same of this tutorial for windows?
Hi,
I tried to run this command
but unfortunately I faced this error, would you please help me. my platform is GPL96
For RMA normalization
Hello, I would like to ask is it possible to use similar functions to work with Agilent data? If not, then can someone tell me where can I find something similar to work with Agilent data.
Thank you
Hi Daniel, yes, you can analyse Agilent microarrays. The architecture of the Agilent arrays is fundamentally different than others, though, so the processing is different too. Here is some sample code for you, which should assist:
Oh my, such a quick response! Thank you very much!
You're welcome! Hope that it helps.
Kevin I need a little bit more help :D, I am newbie at bioinformatics and especially at working with R.
When I use function:
I get this error:
I am working with GSE75915 data.
daniel.naumovas: Have you considered using
GEO2R
which is a built-in analysis application available from NCBI GEO?I was going to suggest something like that because the files that the group have made available are not recognised as they don't have the expected column names for the 2-colour Agilent array. It looks like they have been processed to a certain point.
I can't make my own data available as it's protected by agreements.
I have tried GEO2R, but you cant do normalization with it :/ Need normalized data to work with WGCNA
Hello again, the data is already normalised.
I download the normalised [edit:] intensities from here: ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE75nnn/GSE75915/matrix/
If you want to annotate these probe IDs, I suggest that you do it manually by downloading the human annotation from Agilent's website, here. Alternatively, take a look at the annotate package in R.
Kevin
I added code markup to your post for increased readability. You can do this by selecting the text and clicking the 101010 button. When you compose or edit a post that button is in your toolbar, see image below:
Hello, my targets file just contains this (tab-delimited):
The files (txt or CEL) are in a directory called
SampleFiles
. I think that the column name has to beFileName
, whist the others can be anything (these others are used forlimma
).Does that help?
Hmm I cant think which column should be named FileName. Maybe your data is available in GEO db so I could use it as example?
Hello Dear Obi Griffith,
I am beginner using bioconductor, I like so much your workflow is very explanatory, however in the end based on the data that are generated the file
rma.txt
how do I analyze using the limma pack? Would you have some workflow as good as yours that I can follow?Note: I am working with two normal states and tumor, how to compare the expression between them?
Hi! Does somebody know if the tutorial from the Bioinformatics KnowledgeBlog is still available somewhere? I cannot open it anymore and I am struggling with the exact same task.. Thanks a lot!
Hello sir,
Could you please suggest to me any R script for microarray miRNA analysis of the Exiqon miRCURY LNA microRNA array dataset?
Thank you.