using R aggregate command/combining affy probes into gene IDs
1
5
Entering edit mode
8.2 years ago
StephanieK ▴ 110

I know variant of this question are asked a lot, but I can't find my specific question. Also, I know there are arguments about whether or not you should combine probes into genes, I am absolutely sure for my particular analysis, I want to collapse the probes into genes.

The background: I obtained CEL and _full.soft files from GEO. I ran RMA on a set of CEL files, giving me a matrix of sample names on the X axis, probe IDs on the y axis and then expression values for each sample/probe ID in each cell.

I want to obtain the mean (or highest, I imagine this is easy to change with a parameter) expression value per gene (i.e. I want each sample to be represented by a gene's expression value, not a probe's expression value).

I have the probe to gene mapping in to the _full.soft files.

So now, I have a set of expression values for each probe (in each matrix) and a set of probe to gene mappings (in the _full.soft file) and I want to combine them. I have two questions:

I've tried to use the aggregate function in R:

Test Data Input:

Probe      Gene    Sample1     Sample2       Sample3         Sample4 
Probe1    Gene1    10.0        12.3            12.4               2.0   
Probe2    Gene1    45.0        23.2            12.4              12.4    
Probe3    Gene2    10.0        110.0           1.3                1.4    
Probe4    Gene2     4.5         65.2           34.2              89.3    
Probe5    Gene5     1.2         3.4            6.7              2.3

The code:

table <-read.table("TestData",header=T)
data <-as.data.frame(table,header=T)
aggdata <-aggregate(data,by=list(Gene),FUN=mean,na.rm=TRUE)

The error:

Warning messages:
1: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA
6: In mean.default(X[[i]], ...) :
  argument is not numeric or logical: returning NA

The output:

> aggdata

  Group.1 Probe Gene Sample1 Sample2 Sample3 Sample4
1   Gene1    NA   NA   27.50   17.75   12.40    7.20  
2   Gene2    NA   NA    7.25   87.60   17.75   45.35   
3   Gene5    NA   NA    1.20    3.40    6.70    2.30

Question 1: As you can see, the numbers are right; but I want to remove the error, and explain that I only want to calculate for the sample columns, not the probe or gene columns. I know that the answer is to change the command in my script to say "only calculate mean from the 3rd column on (assuming I start column counting at 0), but I can't figure out how to do it, I keep getting errors.

Question 2: Is this the best function for this purpose (i.e. to combine different probes into the same gene). What other packages to researchers use?

Thanks.

affy aggregate r expression probe • 6.8k views
ADD COMMENT
0
Entering edit mode

Hi, see my answer for a solution with the tidyverse. One extra comment: In your code, you don't need the line as.data.frame(table,header=T), as read.table already reads the data as a data.frame (also, I don't think header=T is doing anything in as.data.frame- most likely is being ignored).

ADD REPLY
0
Entering edit mode

I finally found why the aggregate function was not working. See my edit at the end of the answer.

ADD REPLY
13
Entering edit mode
8.2 years ago
ddiez ★ 2.0k

I am no familiar with the aggregate function (although maybe I should). I tried it and got the same warnings as you- will have to investigate why. In the meantime, this is how you can do this with some packages from the so-called tidyverse. This is a metapackage that installs several packages very handy for data wrangling. Indeed, only two of those packages are required (indicated in the code). I am assuming you want to compute the average of the probes that map to the same gene, per sample:

# example data:
d
   Probe  Gene Sample1 Sample2 Sample3 Sample4
1 Probe1 Gene1    10.0    12.3    12.4     2.0
2 Probe2 Gene1    45.0    23.2    12.4    12.4
3 Probe3 Gene2    10.0   110.0     1.3     1.4
4 Probe4 Gene2     4.5    65.2    34.2    89.3
5 Probe5 Gene5     1.2     3.4     6.7     2.3

# load required packages.
library(tidyr)
library(dplyr)

# transform data to long version:
dd <- d %>% gather(sample, value, -Gene, -Probe)
dd
    Probe  Gene  sample value
1  Probe1 Gene1 Sample1  10.0
2  Probe2 Gene1 Sample1  45.0
3  Probe3 Gene2 Sample1  10.0
4  Probe4 Gene2 Sample1   4.5
5  Probe5 Gene5 Sample1   1.2
6  Probe1 Gene1 Sample2  12.3
7  Probe2 Gene1 Sample2  23.2
8  Probe3 Gene2 Sample2 110.0
9  Probe4 Gene2 Sample2  65.2
10 Probe5 Gene5 Sample2   3.4
11 Probe1 Gene1 Sample3  12.4
12 Probe2 Gene1 Sample3  12.4
13 Probe3 Gene2 Sample3   1.3
14 Probe4 Gene2 Sample3  34.2
15 Probe5 Gene5 Sample3   6.7
16 Probe1 Gene1 Sample4   2.0
17 Probe2 Gene1 Sample4  12.4
18 Probe3 Gene2 Sample4   1.4
19 Probe4 Gene2 Sample4  89.3
20 Probe5 Gene5 Sample4   2.3

# compute average by Gene, Sample (this returns a tibble instead of data.frame):
dm <- dd %>% group_by(Gene, sample) %>% summarize(mean = mean(value))
dm
Source: local data frame [12 x 3]
Groups: Gene [?]

     Gene  sample  mean
   <fctr>   <chr> <dbl>
1   Gene1 Sample1 27.50
2   Gene1 Sample2 17.75
3   Gene1 Sample3 12.40
4   Gene1 Sample4  7.20
5   Gene2 Sample1  7.25
6   Gene2 Sample2 87.60
7   Gene2 Sample3 17.75
8   Gene2 Sample4 45.35
9   Gene5 Sample1  1.20
10  Gene5 Sample2  3.40
11  Gene5 Sample3  6.70
12  Gene5 Sample4  2.30

# convert data back to wide format (if needed, also coerce to data.frame, if needed):
dm %>% spread(sample, mean) %>% as.data.frame
   Gene Sample1 Sample2 Sample3 Sample4
1 Gene1   27.50   17.75   12.40    7.20
2 Gene2    7.25   87.60   17.75   45.35
3 Gene5    1.20    3.40    6.70    2.30

EDIT

I understand now why your use of aggregate didn't work. The problem is that you are passing the entire data.frame and so, it tries to compute the mean of columns Probe and Gene as well. Obviously this is not possible and instead an NA is produced. This is how to call aggregate with the example data:

# pass only the numeric data that we want to aggregate:
aggregate(d[, -c(1,2)],
          by = list(Gene = d$Gene),
          FUN = mean,
          na.rm = TRUE)

   Gene Sample1 Sample2 Sample3 Sample4
1 Gene1   27.50   17.75   12.40    7.20
2 Gene2    7.25   87.60   17.75   45.35
3 Gene5    1.20    3.40    6.70    2.30

Note that aggregate joins the aggregating vector to the data.frame (which didn't contain that column any more). This is very convenient.

ADD COMMENT
0
Entering edit mode

Is mean the function that is used most commonly?

ADD REPLY
0
Entering edit mode

Such conversions may take time for a big matrix ( 400 sample x 60 000 probe ID). I am curious about not the most commonly used one, but the one which is memory efficient, fast, and less error-prone?

ADD REPLY

Login before adding your answer.

Traffic: 3473 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6