Hi,
Here's what I have:
library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]
Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing
gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897
So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.
I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?
OK first question is why do you want to combine the expression levels of multiple probesets to one gene? I have to say with Affy data I almost exclusively work at the probeset level, and I'd imagine most other people do. There's a lot of information in those probesets - and you might not want to be chucking it away right from the outset..
That's the way I would go about it. The problem is that probesets (especially from a chip like U133A which I think you're analysing) were designed to different builds of the underlying genome. Some probesets match multiple genes/transcripts/splice variants, some are misannotated etc. Best to work out which probesets are differentially expressed, then worry about disambiguating the gene level stuff at the end. Not to say that someone won't provide an answer to your problem however... :)
I guess because this is how my limited understanding works! I'm looking for differentially expressed /genes/ one way or the other. Maybe I should be looking at differentially expressed probesets, then worry about which genes these probesets are associated with at the end of the analysis, rather at the start? This being the standard approach would explain my failure googling....
Similar questions related to probesets here : Please take a look at Iam simpson's suggestions on dealing with differential expression hits based on different probes of same genes.