I know variant of this question are asked a lot, but I can't find my specific question. Also, I know there are arguments about whether or not you should combine probes into genes, I am absolutely sure for my particular analysis, I want to collapse the probes into genes.
The background: I obtained CEL and _full.soft files from GEO. I ran RMA on a set of CEL files, giving me a matrix of sample names on the X axis, probe IDs on the y axis and then expression values for each sample/probe ID in each cell.
I want to obtain the mean (or highest, I imagine this is easy to change with a parameter) expression value per gene (i.e. I want each sample to be represented by a gene's expression value, not a probe's expression value).
I have the probe to gene mapping in to the _full.soft files.
So now, I have a set of expression values for each probe (in each matrix) and a set of probe to gene mappings (in the _full.soft file) and I want to combine them. I have two questions:
I've tried to use the aggregate function in R:
Test Data Input:
Probe Gene Sample1 Sample2 Sample3 Sample4
Probe1 Gene1 10.0 12.3 12.4 2.0
Probe2 Gene1 45.0 23.2 12.4 12.4
Probe3 Gene2 10.0 110.0 1.3 1.4
Probe4 Gene2 4.5 65.2 34.2 89.3
Probe5 Gene5 1.2 3.4 6.7 2.3
The code:
table <-read.table("TestData",header=T)
data <-as.data.frame(table,header=T)
aggdata <-aggregate(data,by=list(Gene),FUN=mean,na.rm=TRUE)
The error:
Warning messages:
1: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
4: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
5: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
6: In mean.default(X[[i]], ...) :
argument is not numeric or logical: returning NA
The output:
> aggdata
Group.1 Probe Gene Sample1 Sample2 Sample3 Sample4
1 Gene1 NA NA 27.50 17.75 12.40 7.20
2 Gene2 NA NA 7.25 87.60 17.75 45.35
3 Gene5 NA NA 1.20 3.40 6.70 2.30
Question 1: As you can see, the numbers are right; but I want to remove the error, and explain that I only want to calculate for the sample columns, not the probe or gene columns. I know that the answer is to change the command in my script to say "only calculate mean from the 3rd column on (assuming I start column counting at 0), but I can't figure out how to do it, I keep getting errors.
Question 2: Is this the best function for this purpose (i.e. to combine different probes into the same gene). What other packages to researchers use?
Thanks.
Hi, see my answer for a solution with the tidyverse. One extra comment: In your code, you don't need the line
as.data.frame(table,header=T)
, asread.table
already reads the data as adata.frame
(also, I don't thinkheader=T
is doing anything inas.data.frame
- most likely is being ignored).I finally found why the aggregate function was not working. See my edit at the end of the answer.