I'm following the suggestions given here to parse the values of transcript array signals. I'm highly confused with the output of the command head(exprs(gse[[1]])).
Does it give the value in terms of log2 of the transcript array signal or the transcript array signal itself?
For example,gs <- getGEO("GSE30732",GSEMatrix=TRUE)
head(exprs(gse[[1]]))
GSM762810 GSM762811 GSM762812 GSM762813 GSM762814 GSM762815 GSM762816
7892501 144.1970 154.9120 204.9800 170.7820 160.4320 177.0040 164.8360
7892502 87.3393 66.1895 236.0300 67.5643 118.0320 110.5290 124.1790
7892503 17.6620 25.2553 25.0095 35.4209 34.8682 16.8056 19.0721
7892504 968.6140 552.2480 629.9800 877.5390 1119.0500 395.1390 657.9350
7892505 48.9950 56.7481 42.8267 41.8391 42.8017 52.4028 40.8999
7892506 17.0715 11.3087 87.3541 18.2260 33.5276 33.4526 36.1377
GSM762817 GSM762818 GSM762819 GSM762820
7892501 160.8070 192.1560 97.9872 256.8820
7892502 93.8661 125.8370 109.0660 131.7860
7892503 19.9689 24.9181 32.5494 31.5395
7892504 934.4180 851.7840 803.8410 736.0790
7892505 58.3426 46.8862 49.7869 50.8334
7892506 29.2909 34.7242 35.1665 33.4917
Whereas ,for
gs <- getGEO("GSE3984",GSEMatrix=TRUE)
> head(exprs(gs[[1]]))
GSM90867 GSM90868 GSM90869 GSM90870 GSM90871 GSM90872
244901_at 6.653998 7.054338 6.806130 6.829245 7.027696 7.344291
244902_at 5.288584 6.662941 6.267735 5.781530 6.039393 6.848940
244903_at 6.400026 7.275735 7.289969 6.682002 7.000932 7.482286
244904_at 6.089982 6.441873 6.638312 5.969956 6.633428 6.733617
244905_at 6.187861 6.272919 6.517373 6.599013 6.209060 6.415216
244906_at 7.462396 7.822089 7.852994 7.329415 7.663500 8.192281
The above output is with the probe ids and the values appear to be log2 values.
How can we parse the value of the transcript array signal and not the log2 values?
Many thanks
HI Kevin, Many thanks for the response. Excuse me for the naive question.I'm a beginner in this field. Could you please explain what it exactly means when we say the data is "normalized"? Does the expression value given in log2 scale mean the data is normalized?
This syntax doesn't allow me to use command like exprs(gse[[1]]). I obtain an error ,
this S4 class is not subsettable
, the same error shown in the question posted in the link shared by you. I understand this command can be used when for matrix series files.exprs(gse[[1]]) works for parsing the log2 scale values from matrix series files and Table(gds) works for parsing values from gds files. How do we parse
normalised (unlogged)
data from SOFT files? Could you please provide the syntax that has to be used to get theunlogged
data? I wish to obtain the gene names and cell type description too.( For which you suggested pData (ESET[[1]]) in my previous post.)What is the difference between normalised and in normalised data? To compare data from multiple studies, is it recommend to use the normalised data or should one use unnormalised data?
Is there any tutorial on how to normalize data from CEL files?
I had a chance to use GEO2R before, since I am trying to analyze data from many different experimental studies , I am trying to write codes to automate the analysis.
Many thanks
Hey, oh, if you want a good introduction to microarrays, then this is a must read (the author is also a very decent guy): Microarray data normalization and transformation.
You could consider that there are 3 types of microarray expression values:
raw intensities
These are stored in CEL, TXT, or other (depending on the platform) raw data files and are literally measures of fluorescence. That is, on the actual microarray chip, when a probe hybridises with its target mRNA, it releases light which is detected by a detector, which then encodes this light / fluorescence numerically, giving the raw intensities.
Here is a chip image from one of my own previous experiments:
The part in the middle is the sample loading point. In the corners (and center, by the looks of it), there are control probes.
Chip designs and detector sensitivities change from platform / version to platform / version; thus, the range of raw data fluorescent intensities could differ a lot between, for example, an Agilent and Affymetrix microarray.
There is 1 sample per chip.
normalised intensities
In a typical experiment, we will typically have multiple samples, and thus there may exist experimental bias across all samples. On each individual chip, also, as the chips are literally giving off light, there will also always exist background 'noise'. The normalsiation process aims to tackle these types of biases. Probes also have biases in how they hybridise to their target cDNA due to the fact that AT and GC bases require different energy for binding, i.e., GC bias.
For microarrays, the most common form of normalisation is gcRMA (robust multiarray average with GC correction):
Others include MAS5 and RMA, and others.
So, the log2 transformation is typically the final part of the normalisation process. Technically, you just need to do
2^(exprs(gseEset[[1]]))
in order to reverse the log2 transformation, which should then give you the normalised, unlogged data.------------------------------------------------------------
I don't actually know of anyone who routinely uses the SOFT files. You can get a better idea of what's in them by running:
I just checked for this particular experiment and both the SOFT and the Series Matrix files both contain the normalised, log2 data.
Is there any particular reason why you require the normalised, unlogged data? The issue with microarray analysis is that different platforms require different types of processing and may have different related R packages.
Hi Kevin, Apologize for the delay in my response. It took me sometime to understand what I was doing is wrong.
Thanks a lot for the link to the reference.
I was looking at the normalized, unlogged data to compare the expression value of a list of 30 genes from 10 different experimental studies performed using the same platform Affy U133 plus 2. I was having the impression that I can take the normalized values from different studies and make a comparison.
After reading the reference shared by you, I understand the normalization procedure(eg.mean/median) can vary .
I would like to ask you for a few clarifications,if you don't mind.
1.The normalized,unlogged or logged expression value given in these matrices(gseEset <- getGEO("GSE30732", GSEMatrix=TRUE) is an absolute value or a ratio(Is this the T_i value,a ratio, shown in reference)?
2.In a given study, with 20samples (10 disease,10 control)if one has to compute the average of expression values reported in say,control ---would it be meaningful to compute the mean of the normalized values reported in each sample for a given probe id? a) I am a little confused about how the normalization is actually done.Is the normalization factor (N_total) the same for all the samples in a study?(If my understanding is correct, it is the same)Please correct me if I am wrong.
3.If one has to compare the expression value across multiple studies,same platform, is it possible to combine the raw values and normalize? Is there any tool that can be used for this purpose?
Many thanks for your time and attention, Excuse me for the naive questions.
This is actually a very good question and there is much confusion in the community, from what I can see at least. Some array platforms (e.g. 'two-colour' arrays) allow you to probe expression levels of 2 conditions / treatments on each physical microarray chip. In this situation, one colour of light represents one condition, and the other is the other [condition]. The derived expression levels are normalised and logged (base 2) ratios of Condition A intensity / Condition B intensity for each gene target. So, in this situation, the values that we receive are already log2 ratios.
On single colour arrays, which appear to be more common these days, the levels of expression are representative of a single condition and are thus absolute values. The process that I mention in my previous comment was more related to these single-colour arrays. This is in part why one has to carefully navigate the methods for microarray analysis, i.e., because one has to know which platform was used.
This publication explains the fundamental difference between single and two-colour arrays in the opening paragraphs:
"Prediction of clinical endpoints from microarray-based gene-expression measurements can be performed using either of two experimental procedures: (1) a one-color approach, in which a single RNA sample is labeled with a fluorophore (such as phycoerythrin, cyanine-3 (Cy-3) or cyanine-5 (Cy-5)) and hybridized alone to a microarray, or (2) a two-color strategy, in which two samples (usually a sample and a reference) are labelled with different fluorophores (for example, Cy-3 and Cy-5) and are then hybridized together on a single microarray. The resulting data are fundamentally different: while two-color arrays yield ratios of fluorescence intensities (that is, sample fluorescence/reference fluorescence), one-color arrays result in absolute fluorescence intensities, which are assumed to be monotonically (if not linearly) related to the abundance of mRNA species complementary to the probes on the array."
[from: Comparison of performance of one-color and two-color gene-expression analyses in predicting clinical endpoints of neuroblastoma patients]
The array in your study is Affymetrix Human Gene 1.0 ST Array, which is a very popular single colour gene expression array.
To better understand the normalisation method, just look up quantile normalisation, which is by far the most common for microarrays. The logic behind it is quite simple.
If we have single colour arrays (10 disease; 10 control), we determine the mean for the purposes of determining the log2 ratios on a per gene basis. Also, what we do is fit a linear model on a per-gene basis, i.e., perform linear regression. From this linear model 'fit', we can derive a P value. Thus, we will have a log2 ratio and P value for each gene.
In this case, you can definitely just process the samples together; however, when performing the analysis, you should include a covariate relating to the different studies in the statistical model. Combining data from different platforms is obviously more difficult, but not impossible.