Question

Normalizing raw data from Affymetrix expression microarrays

0

Entering edit mode

5.8 years ago

Natasha ▴ 40

Is there any python package that can be used for normalizing data from Affmetrix expression microarrays?

I understand the Affy library of Bioconductor package in R can be used. However, I would like to perform normalization using libraries in python.

I found pyaffy available on github. But there are problems with installation on windows.

Any suggestion on other libraries in python?

gene-expression CEL rma • 2.6k views

ADD COMMENT • link updated 5.8 years ago by ATpoint 89k • written 5.8 years ago by Natasha ▴ 40

score 1 · Answer 1 · 2019-10-16

1

Entering edit mode

5.8 years ago

ATpoint 89k

I strongly recommend to use established Bioconductor packages. Normalization alone will not do as you need to perform differential analysis to get the relevant fold changes. In microarrays only the change between conditions is relevant and for this you will need tools like limma again. The normalized intensities alone per condition are not informative, so do not use them for clustering, use log2 fold changes outputted by limma.

ADD COMMENT • link 5.8 years ago by ATpoint 89k

0

Entering edit mode

Many thanks for the response. I am using Bioconductor packages now.

I would to confirm whether the following commands are correct for assigning the cdf file for normalization. On GEO, 3 platforms are given so I am a little confused. GEO : GSE1133

biocLite("hgu133acdf")
raw.data = ReadAffy(verbose = FALSE, filenames = cels, cdfname = "hgu133acdf")

ADD REPLY • link 5.8 years ago by Natasha ▴ 40

0

Entering edit mode

I told you in a previous post that there are three kinds of samples and this corresponds to three platforms, depending on the samples starting with 1B, 3A and MG. There are mouse and human data in this superset. Of course you can only compare samples from the same platform to each other. Be sure that you check which platform corresponds to the three sample groups by clicking the entries and then searching for the GPL identifier.

ADD REPLY • link 5.8 years ago by ATpoint 89k

0

Entering edit mode

Thanks. Yes, I could find 3 platforms : GPL96, GPL1073 and GPL1074.

I could also get the corresponding gsets by doing

library(Biobase)
library(GEOquery)
library(limma)

# load series and platform data from GEO

gset <- getGEO("GSE1133", GSEMatrix =TRUE, AnnotGPL=TRUE)

if (length(gset) > 1) 
idx <- grep("GPL96", attr(gset, "names")) # likewise I can filter for other platforms
else idx <- 1
gset <- gset[[idx]]
phenoData(gset) gives sample names.

I am not sure how to proceed after the above step.

In the following procedure described in the supplementary file, RMA normalization is done before clustering.

We used Affymetrix microarray data from a recent thorough analysis of the mouse and human transcriptomes [1]. We selected all 54 adult mouse non-cancer samples. The raw intensity data were transformed to normalized expression levels with the robust multi-array average (RMA) lowlevel algorithm [2] implemented in the BioConductor package [3]. We used standard settings, including perfect match (PM) only, model-based background and quantile normalization across experiments [4]. Similar results were obtained using the microarray analysis suite (MAS5) function followed by log-transformation to calculate expression levels (data not shown) Expression distances, tree reconstruction and bootstrap analysis We calculated Euclidean distances between tissue expression vectors, with each dimension corresponding to one gene. Except for the replicate analysis in Figure S1, distances were calculated after averaging expression values across replicates. Trees were constructed from these distances using neighbor joining as implemented in MEGA2 [7]. Similar results were obtained using squared Euclidean distances (data not shown)

From your response in the previous post, I understand log2 transformation has to be performed.

# log2 transform
ex <- exprs(gset)

Could you please explain whether log2 transformation has to be performed using the normalized data?

Also, for performing rma normalization, which requires cfdname in the arguments that are passed, I am not sure how to specify the cdf files . For 3 platforms: GPL96 - cds file can be loaded using biocLite("hgu133acdf")

Although I could locate the cds files of the following 2 platforms on GEO , custom downloads(GSE1133_RAW.tar), I don't know how to load the cdf files from biocLite.

GPL1073 GPL1074.

ADD REPLY • link 5.8 years ago by Natasha ▴ 40