Question

Select Probset With Median Expression From Redundant Probesets - Affy

0

Entering edit mode

11.7 years ago

Bade ▴ 40

Hello All,

I need your help. I am analyzing the affy data and what I have to achieve is that if a gene is represented by multiple probe sets, than I need to select the probe set with the highest median expression across all samples to represent the expression of that gene . This way I want to remove the redundancy by keeping the analysis to single probe set level. I came across 'findLargest' function of 'genefilter' package but its not well documented; and I do not know how to implement the 'findLargest' function. At this point I have: esetRMA <- rma(mydata)

Could anybody guide me on how can I select a single probe set with highest median expression across samples? Is there any other way to achieve so i.e. other than using 'genefilter'? Isn't is be implemented already i.e. summarization to collapse probe set level to gene level analysis

Genefilter package http://www.bioconductor.org/packages/2.11/bioc/html/genefilter.html

Thanks

AK

r affymetrix filtering probeset • 3.8k views

ADD COMMENT • link 11.7 years ago by Bade ▴ 40

score 3 · Accepted Answer · 2013-04-03

EDIT: question was unclear and I misunderstood the first time around. Here is the second attempt.

So - what you want is the "maximum of medians", across samples. Here's a toy data frame - rows = probesets, s1 - s3 = samples:

df1 <- data.frame(s1 = c(1:4), s2 = c(2:5), s3 = c(5:8), gene = c("g1", "g1", "g2", "g2"))
df1

  s1 s2 s3 gene
1  1  2  5   g1
2  2  3  6   g1
3  3  4  7   g2
4  4  5  8   g2

You can use apply() to add a column containing the medians for each row:

df1$med <- apply(df1[, 1:3], 1, median)
df1

  s1 s2 s3 gene med
1  1  2  5   g1   2
2  2  3  6   g1   3
3  3  4  7   g2   4
4  4  5  8   g2   5

Use aggregate() to get the "maximum median" per gene:

df2 <- aggregate(med ~ gene, df1, max)
df2  

  gene med
1   g1   3
2   g2   5

Then you can merge the data frames to get the original rows:

merge(df1, df2)

  gene med s1 s2 s3
1   g1   3  2  3  6
2   g2   5  4  5  8

As in my first answer, you'll need an annotation package or file which maps probesets to gene names for your Affy platform.

FIRST ANSWER

I think you are a little confused. You do not want to select "the probeset with median expression for a gene." There is no such thing. For example, the median of the values 1, 2, 4 and 6 = 3, but no one of those values is itself 3.

What you want is the median value of all probesets for a gene. One way to do this is to use the aggregate() function in R. Imagine a data frame, df1, that looks something like this:

      s1    s2    s3    gene
p1    V11   V12   V13     g1
p2    V21   V22   V23     g1
p3    V31   V32   V33     g2
p4    V41   V42   V43     g2

Row names (p1, p2...) are probesets. Columns s1 - s3 are samples, containing RMA values. Column gene contains 2 genes (g1, g2), each of which have 2 probesets.

To get the median RMA value per gene in a new data frame:

newdf <- aggregate(. ~ gene, df1, median)

Note that this can be quite slow, even for modest-sized data frames. There is sure to be an equivalent function in one of the Bioconductor packages. You will also need an annotation package or file which maps probesets to gene names for your Affy platform.