Thanks to both, i can propose something here for other people that might have the same problem :
Suppose you want to analyse 4 datasets A, B, C, and D, which come from 3 platforms hgug4112a, hgu133a and hgu95av2.
Due to preprocessing, the sets of probes on each dataset may be smaller than the total set of probes. this is taken into account.
The script is :
x = hgu133aENSEMBL
x.mapped = mappedkeys(x)
# The line below is to only take into account probes that were not
# filtered during preprocessing. There are 2 datasets from the same platform here.
x.mapped = x.mapped[which(x.mapped%in%row.names(A.data) == TRUE &
x.mapped%in%row.names(B.data) == TRUE )]
y = hgug4112aENSEMBL
y.mapped=mappedkeys(y)
# The line below is to only take into account probes that were not
# filtered during preprocessing. There is only 1 dataset from this platform.
y.mapped = y.mapped[which(y.mapped%in%row.names(C.data) == TRUE)]
# m1 is the the first intersection of ENSEMBL_IDs
m1 = intersect(toTable(x[x.mapped])$ensembl_id,toTable(y[y.mapped])$ensembl_id)
length(m1)
#[1] 10461
#Thihs is for the last (4th) dataset
z = hgu95av2ENSEMBL
z.mapped = mappedkeys(z)
z.mapped = z.mapped[which(z.mapped%in%row.names(D.data) == TRUE)]
# If i'm not wrong, the order in which you cascade the intersections does not matter.
# If we had a 5th dataset, we would do m3 = intersect(m2, toTable( ...
m2 = intersect(m1,toTable(z[z.mapped])$ensembl_id)
length(m2)
# [1] 7476
x.names = unlist(as.list(x[x.mapped]))
x.probes = names(x.names[match(m2,x.names)])
# Match will take only the "first" match, which is a solution for the problem discussed below
length(x.probes)
# [1] 7476
y.names = unlist(as.list(y[y.mapped]))
y.probes = names(y.names[match(m2,y.names)])
length(y.probes)
# [1] 7476
z.names = unlist(as.list(z[z.mapped]))
z.probes = names(z.names[match(m2,z.names)])
length(z.probes)
# [1] 7476
Sorry for that, but there is a mistake in the code above :
DON'T USE THIS SYNTAX : unlist(as.list(x[x.mapped])) => because to avoid duplicate names, it adds a number after the probe name (and toTable don't) :
unlist(as.list(x[x.mapped]))[1:1000]
1007_s_at1 1007_s_at2 1007_s_at3 1007_s_at4
"ENSG00000137332" "ENSG00000204580" "ENSG00000215522" "ENSG00000230456"
1007_s_at5 1053_at 117_at 121_at
"ENSG00000234078" "ENSG00000049541" "ENSG00000173110" "ENSG00000125618"
toTable(x[x.mapped])[1:1000,]
probe_id ensembl_id
1 1007_s_at ENSG00000137332
2 1007_s_at ENSG00000204580
3 1007_s_at ENSG00000215522
4 1007_s_at ENSG00000230456
5 1007_s_at ENSG00000234078
So the right code is :
x.names = toTable(x[x.mapped])
x.probes = x.names$probe_id[match(m2,x.names$ensembl_id)]
length(x.probes)
# [1] 7476
y.names = toTable(y[y.mapped])
y.probes = y.names$probe_id[match(m2,y.names$ensembl_id)]
length(y.probes)
# [1] 7476
z.names = toTable(z[z.mapped])
z.probes = z.names$probe_id[match(m2,z.names$ensembl_id)]
length(z.probes)
# [1] 7476
And then, you can construct a reduced eset with this :
A.eset = new("ExpressionSet",
exprs = as.matrix(A.data[
which(y.probes%in%row.names(A.data)),1:9],
dimnames=list(c(y.probes,row.names(A.pData)))),
phenoData = A.phenoData, experimentData = A.expData,
annotation = "hgug4112a")
B.eset = new("ExpressionSet",
exprs = as.matrix(B.data[
which(x.probes%in%row.names(B.data)),7:15],
dimnames=list(c(x.probes,row.names(B.pData)))),
phenoData = B.phenoData, experimentData = B.expData,
annotation = "hgu133a")
etc ...
So this works fine, but it comes with another problem i can't answer now, how to choose between probes for the same ID. For exemple, in Jeremy Leipzig example, we get a data.frame like that :
merge(toTable(x[x.mapped]),toTable(y[y.mapped]),by="ensembl_id")[1:10,]
ensembl_id probe_id.x probe_id.y
1 ENSG00000000419 202673_at A_23_P68472
2 ENSG00000000457 41329_at A_23_P74320
3 ENSG00000000457 205607_s_at A_23_P74320
4 ENSG00000000460 220840_s_at A_23_P11862
5 ENSG00000000938 208438_s_at A_24_P51517
6 ENSG00000000938 208438_s_at A_23_P103932
7 ENSG00000000938 208438_s_at A_24_P316634
8 ENSG00000001084 202922_at A_23_P352879
9 ENSG00000001084 202922_at A_23_P145114
10 ENSG00000001084 202923_s_at A_23_P352879
11 ENSG00000001084 202923_s_at A_23_P145114
The dimensions of the resulted data.frame is 35006 x 3, which is higher than the original tables' dimensions that were : x => 21363 x 2 and y => 22712 x 2 (because all mappings are given, see lines 5 to 7, and 8 to 11).
I don't know wich slot of the annotation object is the best for the mapping. I choosed ensembl_id but one could have take SYMBOL, or something else. I think CHRLOC is too precise, but SYMBOL might exclude non annotated genes ? Do you know if we could have a slot in the annotation object to mapped probes to an exon ? That might help to match probesets one to each others ?
Thanks again for the help.
Julien