Hi,
I'm new to microarray experiments - have no experience and trying to get a grip. I am using GDS596 (well known Su et al 2004 PNAS data), and trying to get a single expression value for each gene. Essentially, I am looking to replicate the analysis done in Tartaglia et al "Life on the Edge..." TRENDS Biochem Sci 32(5), 2007.
I have obtained the raw data human *.CEL files, and would like some clarification on the steps taken. I have a few questions that come up below.
1.) MAS5 normalization (for background correction via R affy package) - change to take log10 of these values, and then average across genes and across experiments. Fine (also can use rma and gcrma).
2.) The authors then "median scale followed by quantile normalization". So, scaling across experiments (i.e. GSM columns) allows us to make comparisons between experiments. Fine. Although I don't scale row-wise as some other papers do (not sure on why you would do this?).
3.) Then, quantile normalization? Why is this step taken? I had thought that this was done at the probe level. If intensities are normalized (MAS5), and corrected for across experiments (median scaling), why another normalization?
It seems that I find 'ok' correlation (pearson's rho ~.77) with the paper's expression values after first 2 steps, but then quantile normalization screws everything up. Are there obvious things I'm doing wrong here?
Thanks
greg
One small comment. People typically perform log2 of microarray data. I've rarely (if ever) seen log10. I think gcrma and rma already output log2 data. I'm not sure about mas5. So, you could try various combinations of not logging or log2 instead of log10 to see if you get better correlation with the paper's expression values. I typically just process cel files with gcrma and don't do an additional median scaling and quantile normalization of that data. GCRMA already includes a quantile normalization. Summarizing to the gene level is a separate issue and will depend on which chip you have.
What is the possible solution if you apply GCRMA but you only get 3 genes with lfc greater than 1. Is it possible to read cel files but apply only log2 transformation, as used by geo2R analysis by NCBI GEO? By using geo2R approach for the exact same samples gives me all top 250 genes above lfc=1.
Please ask a separate question rather than asking a question as a comment to a post.
@Sean: Check this one: Microarray analysis of CEL files with Log-transformation instead of GCRMA or RMA
The base of the log does not affect something like a correlation or statistics related to differential expression. Log2(x) and log10(x) simply differ by a scale constant (3.32193). The log2 and log10 distributions are therefore, identical, except scaled by a constant.
Mathematically this is correct, but log2 is convenient because many people find it easier to think about doubled values rather than powers of 10.
Well, I'm just going to use RMA instead - not rely on the other publications' protocol.