gene co-expression on microarray assembled from GDS datasets is influenced by strong log2 negative outliers
2
0
Entering edit mode
10.0 years ago
grokaine ▴ 40

I need to compute gene co-expression for a compendium of GEO microarrays. I downloaded a number of GDS datasets corresponding to two GPL platforms and merged them in a gene expression table. After lo2 transforming them I obtained a lot of negative values. Negative values come from small gene expression values (between 0 and 1), however due to the log2 transformation they create outliers. These outliers are influencing any type of co-expression measurements. The GDS datasets are supposed to be both background corrected and normalized, but I performed quantile normalization to re-align the probe distribution among datasets. I still have too many negative values though.

How do you recommend me to proceed?

  1. Download raw .CEL files and perform unitary background correction/normalization? I saw people saying that this improves the overall quality but I am not convinced. Mainly because these operations are mostly performed to eliminate consistent noise due to specific experimental conditions. Second, negative values are already present in the GDS datasets after all the statistical proofing, so what is to guarantee I will not endup in the same situation, especially since I will use many different experiments?
  2. Add 1.0 to all expression values before log2 transforming them. This is my favored solution.
  3. Not using any log2 transformation (why is this used anyway?). However this would make outliers even stronger.
  4. ???
GEO co-expression normalization Microarray • 3.0k views
ADD COMMENT
2
Entering edit mode
10.0 years ago

There is a correlation (in the qualitative sense, not in the quantitative sense) between low expression values on an array and variance. This is irrespective of the presence or absence of negative values, so I wouldn't focus on the negative values. Log transformation is used to bring the expression measures to a more bell-shaped distribution and to make the variance across expression values more similar.

I would suggest getting the .CEL files and normalizing with rma or frozen RMA. I would not manipulate the output in an ad hoc manner without good evidence to do so; there is over a decade of experience with Affy microarrays that you would potentially be invalidating by doing ad hoc stuff....

Finally, since you are interested in correlations, you can use variance filters on the features to remove features that show little or no variance since these are unlikely to show strong correlations. This will functionally remove the lowest expressed features as well.

ADD COMMENT
0
Entering edit mode

Yup I also remove the low variance features, now I realize that the boxplot is not very informative on that aspect, so maybe I should redo it after the low variance feature cleaning..

The GDS datasets are supposed to be already background corrected and normalized, and I am using them so I am not manipulating anything. I made a boxplot of all the samples and It looks obvious that the GDS datasets are well made. I am re-normalizing though to align the GDS datasets better (each GDS block has a slightly different median and dispersion). I do not see a scientific fallacy with my approach (and it is used in multi-platform microarray assemblies). Of course ultimately it all depends on hardline reviewers..

ADD REPLY
0
Entering edit mode
10.0 years ago
Manvendra Singh ★ 2.2k

best is to process from cel files

have affy package, process it its easy and quick, detect signals upto threshold, then transform to log scale,

Negative means that signals are less than one, which would be filtered out when you correct the cel files .

use log2 scale otherwise there would be much variance during comparison

ADD COMMENT

Login before adding your answer.

Traffic: 1586 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6