I've been playing around with the EBI Gene Expression Atlas (GXA). It has an API. So, for example, I can retrieve data about the human gene SRI in JSON format using this URI:
http://www.ebi.ac.uk:80/gxa/api/vx?geneIs=ENSG00000075142&format=json
I wrote some R code to fetch/parse the JSON into a data frame:
library(RCurl)
library(rjson)
library(plyr)
j2df <- function(l) {
e <- lapply(l$results[[1]]$expressions, function(x) {
ef <- x$ef
efv <- x$efv
updn <- sapply(x$experiments, function(y) {
y$updn
})
pval <- sapply(x$experiments, function(y) {
y$pvalue
})
accn <- sapply(x$experiments, function(y) {
y$experimentAccession
})
list(ef = ef, efv = efv, accn = accn, updn = updn, pvalue = pval)
}
)
e <- ldply(e, as.data.frame)
return(e)
}
# fetch the JSON
j <- fromJSON(getURL("http://www.ebi.ac.uk:80/gxa/api/vx?geneIs=ENSG00000075142&format=json"))
# convert to data frame
sri <- j2df(j)
When I examine the first few rows, I see:
head(sri)
ef efv accn updn pvalue
1 cell_line 1A2 E-MTAB-37 DOWN 0.000
2 cell_line 22Rv1 E-MTAB-37 UP 0.003
3 cell_line 22Rv1 E-MTAB-37 DOWN 0.019
4 cell_line 5637 E-MTAB-37 UP 0.000
5 cell_line 647V E-MTAB-37 UP 0.009
6 cell_line 769P E-MTAB-37 UP 0.000
According to rows 2 and 3, the same gene (SRI) in the same experiment (E-MTAB-37) is both up-regulated (p = 0.003) and down-regulated (p = 0.019) in cell line 22Rv1, as compared with mean expression from all cell lines. At least, that is my understanding of UP and DOWN as defined in the GXA documentation.
Am I missing something obvious? Or are the data returned by the GXA API simply nonsense?
Using a custom cdf like the ones from brainarray (http://brainarray.mbni.med.umich.edu/Brainarray/Database/CustomCDF/genomiccuratedCDF.asp), where all probes targeting the same gene would be combines should prevent this problem. Of course the different probes could also target different transcripts for the same gene which would give a biological explanation for what you found.
randomly clicking around and selecting various cell lines one can find other similar examples: D341Med, Detroit562, H4, HPAFII where the designations don't match. Yet have really high p-values, (D341Med has p-values of E-7 and E-10 indicating opposing behaviors) in many other cases one of the p-values is ridiculously low 1E-10 whereas the other is non-defined.
in a way demonstrates the utility (or lack thereof) of p-values