This is how I would process that particular dataset in R/Bioconductor. It assumes you have latest version of R installed and will have to change some working directories.
#install the core bioconductor packages, if not already installed
source("http://bioconductor.org/biocLite.R")
biocLite()
# install additional bioconductor libraries, if not already installed
biocLite("GEOquery")
biocLite("affy")
biocLite("gcrma")
biocLite("hugene10stv1cdf")
biocLite("hugene10stv1probe")
biocLite("hugene10stprobeset.db")
biocLite("hugene10sttranscriptcluster.db")
#Load the necessary libraries
library(GEOquery)
library(affy)
library(gcrma)
library(hugene10stv1cdf)
library(hugene10stv1probe)
library(hugene10stprobeset.db)
library(hugene10sttranscriptcluster.db)
#Set working directory for download
setwd("/Users/ogriffit/Dropbox/BioStars")
#Download the CEL file package for this dataset (by GSE - Geo series id)
getGEOSuppFiles("GSE27447")
#Unpack the CEL files
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447")
untar("GSE27447_RAW.tar", exdir="data")
cels = list.files("data/", pattern = "CEL")
sapply(paste("data", cels, sep="/"), gunzip)
cels = list.files("data/", pattern = "CEL")
setwd("/Users/ogriffit/Dropbox/BioStars/GSE27447/data")
raw.data=ReadAffy(verbose=TRUE, filenames=cels, cdfname="hugene10stv1") #From bioconductor
#perform RMA normalization (I would normally use GCRMA but it did not work with this chip)
data.rma.norm=rma(raw.data)
#Get the important stuff out of the data - the expression estimates for each array
rma=exprs(data.rma.norm)
#Format values to 5 decimal places
rma=format(rma, digits=5)
#Map probe sets to gene symbols or other annotations
#To see all available mappings for this platform
ls("package:hugene10stprobeset.db") #Annotations at the exon probeset level
ls("package:hugene10sttranscriptcluster.db") #Annotations at the transcript-cluster level (more gene-centric view)
#Extract probe ids, entrez symbols, and entrez ids
probes=row.names(rma)
Symbols = unlist(mget(probes, hugene10sttranscriptclusterSYMBOL, ifnotfound=NA))
Entrez_IDs = unlist(mget(probes, hugene10sttranscriptclusterENTREZID, ifnotfound=NA))
#Combine gene annotations with raw data
rma=cbind(probes,Symbols,Entrez_IDs,rma)
#Write RMA-normalized, mapped data to file
write.table(rma, file = "rma.txt", quote = FALSE, sep = "\t", row.names = FALSE, col.names = TRUE)
This produces a tab-delimited text file of the following format. Note that many probes will have "NA" for gene symbol and Entrez ID.
probes Symbols Entrez_IDs GSM678364_B2.CEL GSM678365_B4.CEL GSM678366_B5.CEL ...
7897441 H6PD 9563 6.5943 7.0552 7.5201 ...
7897449 SPSB1 80176 6.9727 7.0281 7.2285 ...
7897460 SLC25A33 84275 7.6659 7.4289 7.9707 ...
Hey k.nirmalraman. Actually the link you provided is for Human Exon 1.0 ST Array. I think the original poster wants this one for Human Gene 1.0 ST Array. A google search strangely returns the wrong thing as top result which is I suspect what happened to you. For reference, you can get to this file through GEO as well. From the GEO dataset if you look at 'Sample Subsets' and choose one of the samples, then click on the Platform ID, then follow the provided 'Web link' to Affymetrix's site. You will need to register to download it. Although it looks like this is the same r3 CDF which they obtained from aroma. Either should work. Possibly this is a matlab issue?
when we open cel file in matlab it creates a structure which has a field ChipType this field contains name of cdf file as a string so when we provide actual cdf file to open it in matlab with cel file that actual cdf file's name should match with ChipType string (THAT'S WHAT I THINK) so when i removed ",r3" at the end of the unsupported cdf file i got from aroma-project matlab didn't show any warning but that's not the main issue i want to know can i use this unsupported file? is it the same file how can i be sure? Here is the image of matlab cel file structurehttp://i46.tinypic.com/2me8y8l.jpg
Hi, I think your work with Affymetrix data will be easier if you understand the difference between 'chip type' and 'chip definition file (CDF)', cf.http://aroma-project.org/definitions/chipTypesAndCDFs
Unless your software truly prevents you, the best is to avoid renaming your CDFs. For example, what if you have to different versions of CDFs for the same chip type and you are forced to rename it the way you suggest? How you be able to distinguish them afterward?
The HuGene-1_0-st-v1,r3.cdf provided via the Aroma Project is a one-to-one binary version of the ASCII-version that Affymetrix provides. What Obi says about the term "unsupported" is correct.
I imagine it is the same file and yes I think it would be reasonable to use it. I think Affymetrix only calls it unsupported because they would now prefer (and support) use of the plier compatible files to be used with their own software. If it was me, I would do all this in R/Bioconductor. I will give you an example workflow.
thanks for the reply actually i got one cdf file from herehttp://www.aroma-project.org/chipTypes/HuGene-1_0-st-v1 but it is revision #3 (HuGene-1_0-st-v1,r3.cdf) (that's what i understand by it's name) may be it's the same file or it's not exact file using which that experiment was done and when i opened it in matlab, matlab showed warning that the cdf file name provided in cel file and this cdf file are not same i also edited it's name and removed ",r3" then everything went perfect i did till gene filtering i checked affymetrix website couldn't find there any cdf file named HuGene-1_0-st-v1.cdf and one more thing what is this affymetrix BED file?