I need to compute gene co-expression for a compendium of GEO microarrays. I downloaded a number of GDS datasets corresponding to two GPL platforms and merged them in a gene expression table. After lo2 transforming them I obtained a lot of negative values. Negative values come from small gene expression values (between 0 and 1), however due to the log2 transformation they create outliers. These outliers are influencing any type of co-expression measurements. The GDS datasets are supposed to be both background corrected and normalized, but I performed quantile normalization to re-align the probe distribution among datasets. I still have too many negative values though.
How do you recommend me to proceed?
- Download raw .CEL files and perform unitary background correction/normalization? I saw people saying that this improves the overall quality but I am not convinced. Mainly because these operations are mostly performed to eliminate consistent noise due to specific experimental conditions. Second, negative values are already present in the GDS datasets after all the statistical proofing, so what is to guarantee I will not endup in the same situation, especially since I will use many different experiments?
- Add 1.0 to all expression values before log2 transforming them. This is my favored solution.
- Not using any log2 transformation (why is this used anyway?). However this would make outliers even stronger.
- ???
Yup I also remove the low variance features, now I realize that the boxplot is not very informative on that aspect, so maybe I should redo it after the low variance feature cleaning..
The GDS datasets are supposed to be already background corrected and normalized, and I am using them so I am not manipulating anything. I made a boxplot of all the samples and It looks obvious that the GDS datasets are well made. I am re-normalizing though to align the GDS datasets better (each GDS block has a slightly different median and dispersion). I do not see a scientific fallacy with my approach (and it is used in multi-platform microarray assemblies). Of course ultimately it all depends on hardline reviewers..