Hello, I always believed expression matrix on GEO is normalized. However, I get huge big log2FC from GSE85957 today.
> head(expr_3)
# A tibble: 6 x 8
SYMBOL logFC AveExpr t P.Value adj.P.Val B ENTREZID
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
1 Spp1 3419. 2421. 10.2 0.00000368 0.00128 0.221 25353
2 Gstp1 2125. 2338. 10.3 0.00000328 0.00128 0.243 24426
3 Cyp2e1 2047. 2833. 4.32 0.00204 0.0235 -1.89 25086
Here is how my expression data extracted
gse_path <- "/datapool/pengguoyu/Microarray/20190711_geo/rawdata/GSE85957_series_matrix.txt.gz"
gse <- getGEO(filename=gse_path, AnnotGPL=TRUE)
expr <- exprs(gse)
So I go back to check expression matrix
PROBEID GSM2288460 GSM2288461 GSM2288462 GSM2288463 GSM2288464
1367452_at 1165.0328 1011.4838 1193.8429 1143.6874 1162.2721
1367453_at 512.07166 519.57355 502.8087 433.26254 480.2318
1367454_at 647.18243 619.50635 673.89526 644.89575 685.5907
1367455_at 1226.1555 1299.9249 1318.0239 1363.5055 1308.6063
1367456_at 1530.6841 1611.0748 1768.4469 1761.0474 1751.5911
1367457_at 426.08826 282.9359 433.74475 421.27148 445.81595
This seems to be data without normalized. How can I know any one expression matrix from GEO is normalized or not, wether I can apply lmFit
function from limma
directly? Thanks.
You posted while I was writing my own answer, but, yes, the samples are normalised.
Here was my answer:
---------------------------------
The answer is that you can never be sure. The GEO even states this on their web-site (somewhere) that they cannot guarantee that each dataset will be normalised. This is partly why data curation can be so problematic and time consuming.
I have looked at your dataset, though, and the data is normalised; however, the normalisation method that was used was MAS 5.0, which is not as common as RMA normalisation. If you look at an individual sample record, you will see this:
[source: https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM2288450]
So, when you download the data, I think that you should log2 transform it. MAS 5.0 normalisation does not involve any log2 transformation (unlike RMA).
If you plot a histogram of your pre- and post-transformed data, you will instantly see the effect of log2 transformation:
So, in summary:
Kevin