Hi all.
I have 10 gene expression data sets (i.e. gene expression data:.CEL files and phenotype data: age of sample; either as a category "old" or "young", or the numerical age of the sample) that I downloaded from GEO. The 10 data sets are different organisms (i.e. mouse, rat and human) and are all Affymetrix single-channel (although different platforms, e.g. GPL85, GPL96). The aim is to do a differential expression analysis of the genes with age per data set (and then I will do a vote-counting method to look at generally which genes are differentially expressed overall between the data sets). I have been looking at using the limma package for this (see here).
My question: I think my data is slightly different, and I'm not sure how to input my data into the algorithm, and I can't seem to find an example that has my particular type of data.
First, for each data set, I used RMA on each data set separately:
library(affy)
data <-ReadAffy()
eset <-rma(data)
write.exprs(eset,file='Dataset1.output')
Then, because I wanted to do a gene expression, and not probe expression analysis (and yes, I know the advantages of doing probe-based, I definitely want gene-based), I added in a "Gene" column that I obtained from the "_full.soft" files from GEO and aggregated the data per gene:
options(max.print=1000000000)
data <-read.table('Dataset1.WithGeneAddedIn',header=T)
sink('Dataset1.Gene.Output')
write.table(aggregate(data[, -c(1,2)],by = list(Gene = data$Gene),FUN = mean,na.rm = TRUE),sep=' ',row.names=FALSE,quote=FALSE)
sink()
So now, I have a standard text file with gene expression data (file called GeneExpression.txt) with the expression levels per gene, instead of per probe, e.g:
Gene Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8
Gene1 8.06 7.84 7.62 7.73 7.85 7.94 7.82 7.84
Gene2 9.00 9.06 9.02 9.04 9.05 9.05 9.09 9.16
Gene3 4.8 4.81 4.98 4.78 4.79 5.02 4.91 4.99
Gene4 7.00 7.18 7.14 7.0 7.07 7.12 7.16 7.07
Gene5 6.46 6.40 6.34 6.49 6.46 6.46 6.45 6.35
And I have a phenotype file like this (File called PhenotypeFile.txt):
SampleID Age
Sample1 14
Sample2 14
Sample3 14
Sample4 14
Sample5 2
Sample6 2
Sample7 2
Sample8 2
So I want to know, for each gene, is there a significant difference in gene expression between the set of young samples (i.e. those individuals that are 2 months old compared to 14 months old in this example, in other data sets, I want to know is there a difference between a range of ages).
The problem: From the workflow in bioconductor, it seems to read in .CEL files and then will give me back differentially expressed probes? Because I want gene-level, and not probe-level analysis, I can't read it in the .CEL files (As I have combined the probes in the .CEL file to genes). So I don't understand how to read in the gene expression data to limma in a "plain text" format the way I have here, where I have already processed it with RMA and converted to gene data myself.Similarly, I want to read in the phenotype for each sample as a plain text file to limma. Then tell limma "This is the gene expression data and samples, this is the phenotype data and samples, this is the categories of samples that I want to compare, is there are genes differentially expressed?"
Would someone know of sample code they could show me, where I could read in my own gene expression data from a text file and not a .CEL file, and phenotypic data from a text file, and then perform a differential expression analysis (obviously don't worry about the parameters of the differential expression analysis e.g. what model to use, if I could just get the skeleton of the code I should use to perform a differential expression analysis using this type of data, I'll be able to understand and see what direction I need to read more in etc myself).
I have found loads of general limma tutorials, but just not specifically how I would do this, or if this is possible.
Thank you for the reply, it is appreciated. I will definitely look into this, but first I just want to try and get an example working, that then I can tinker the details. I attempted to combine the information here and here here to come up with two solutions, but I am running into two problems.
Method 1: I have a phenotype file like this (file called "PhenotypicData"):
and a set of .CEL.gz files in the current working directory
I was attempting to run a script:
When I run this script, I get: There were 48 warnings (use warnings() to see them)
When I add in an extra line to the script:
I get the output: Error: the following are not valid files: /path/to/directory/data_list
When I edit the script:
...and various combinations of the justRMA (e.g. changing "cel_files" to "cel_files/*gz" line to try to tell the script where the .gz files are), I just keep getting:
Error: the following are not valid files: /path_to_files/cel_files/*.gz Execution halted
Method 2: I then tried an approach as in the second link described above.
Example of my gene-based expression data ("GeneExpressionData"):
Example of my feature data ("FeatureData"):
My phenotype data ("PhenotypeData"):
I ran this code:
The error is:
I understand that the error is because in my GeneExpression data has data PER GENE, whereas my feature data are data PER PROBE. I just don't know how to fix this error. I also tried to run this analysis without providing feature data (i.e. probe to gene), but I got an error.
Would you know what's going on here to make the above code run? Many thanks again for your time.