Question

R + Bioconductor : Combining Probesets In An Expressionset

6

Entering edit mode

15.2 years ago

Mike Dewar ★ 1.6k

Hi,

Here's what I have:

library('GEOquery')
GDS = getGEO('GDS785')
cd4T = GDS2eSet(GDS)
cd4T <- cd4T[!fData(cd4T)$symbol == "",]

Now cd4T is an ExpressionSet object which wraps a big matrix with 19794 rows (probesets) and 15 columns (samples). The final line gets rid of all probesets that do not have corresponding gene symbols. Now the trouble is that most genes in this set are assigned to more than one probeset. You can see this by doing

gene_symbols = factor(fData(cd4T)$Gene.symbol)
length(gene_symbols)-length(levels(gene_symbols))
[1] 6897

So only 6897 of my 19794 probesets have unique probeset -> gene mappings. I'd like to somehow combine the expression levels of each probeset associated with each gene. I don't care much about the actual probe id for each probe. I'd like very much to end up with an ExpressionSet containing the merged information as all of my downstream analysis is designed to work with this class.

I think I can write some code that will do this by hand, and make a new expression set from scratch. However, I'm assuming this can't be a new problem and that code exists to do it, using a statistically sound method to combine the gene expression levels. I'm guessing there's a proper name for this also but my googles aren't showing up much of use. Can anyone help?

r bioconductor probeset microarray • 14k views

ADD COMMENT • link updated 6.9 years ago by Ram 45k • written 15.2 years ago by Mike Dewar ★ 1.6k

1

Entering edit mode

OK first question is why do you want to combine the expression levels of multiple probesets to one gene? I have to say with Affy data I almost exclusively work at the probeset level, and I'd imagine most other people do. There's a lot of information in those probesets - and you might not want to be chucking it away right from the outset..

ADD REPLY • link 15.2 years ago by User 59 13k

1

Entering edit mode

That's the way I would go about it. The problem is that probesets (especially from a chip like U133A which I think you're analysing) were designed to different builds of the underlying genome. Some probesets match multiple genes/transcripts/splice variants, some are misannotated etc. Best to work out which probesets are differentially expressed, then worry about disambiguating the gene level stuff at the end. Not to say that someone won't provide an answer to your problem however... :)

ADD REPLY • link 15.2 years ago by User 59 13k

0

Entering edit mode

I guess because this is how my limited understanding works! I'm looking for differentially expressed /genes/ one way or the other. Maybe I should be looking at differentially expressed probesets, then worry about which genes these probesets are associated with at the end of the analysis, rather at the start? This being the standard approach would explain my failure googling....

ADD REPLY • link 15.2 years ago by Mike Dewar ★ 1.6k

0

Entering edit mode

Similar questions related to probesets here : Please take a look at Iam simpson's suggestions on dealing with differential expression hits based on different probes of same genes.

ADD REPLY • link updated 5.8 years ago by Ram 45k • written 15.2 years ago by Khader Shameer 18k

score 8 · Answer 1 · 2010-05-06

So typically this is not done. You would lose a lot of information from doing this, I mean you could take a geometric mean (the probesets of some gene expression data I had showed a log-normal distribution)...not the best.

Typically you do want to reduce the number of probesets you keep in your analysis (to reduce the number of tests you make (effecting any fdr estimates)...so you could do this by only selecting one probeset per gene using some measure of dispersion such as median absolute deviation (MAD) or interquartile range (IQR) and keeping the probeset which has the most variability/spread to be representative for that gene (MAD is better IMO)......although this as a sideline means you may actually be looking at the probeset which is subject to the most noise....you may also want to remove probesets where the majority of its component probes map to multiple locations in the genome (probably leading to dodgy and unreliable results), maybe using SCAMPA: http://web.bioinformatics.ic.ac.uk/scampa/section.html?id=5 or which contain g-spots/g-stacks : http://www.biomedcentral.com/1471-2164/9/613/abstract

But then what part of the gene the probeset maps to is important, exons or introns. Probes which map to different exons may show big differences: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1784106/

and some people have suggested that it is useful to map probes and probesets to transcripts rather than genes: http://www.ncbi.nlm.nih.gov/pubmed/17394657

Hopefully this will give you some ideas what to do with your probesets and reduce the number of them.

score 2 · Answer 2 · 2010-05-05

So I said something completely different over on SO (http://tinyurl.com/2ebczrs), but reading through the comments one thing that came to mind was that there exist alternate annotations for affy chips, which end up producing a single gene per probeset (in some cases) which has some evidence towards being a valuable thing to do: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1283542/

score 1 · Answer 3 · 2012-02-22

1

Entering edit mode

13.4 years ago

Ekta Jain ▴ 10

Hello, LIMMA in R can give you a list of differentially expressed genes. LIMMA averages the expression of multiple probesets. I do not know how to simply use the probesets with highest signal intensity.

Hope this helps.

Ekta

ADD COMMENT • link 13.4 years ago by Ekta Jain ▴ 10