Hi all.
I am doing my first gene expression meta-analysis, and basically, the way that I thought I would build a meta-dataset from individual data sets using GEO is wrong. So I've gone back to the literature, and come up with a new method, and I was wondering what people think.
The aim of the experiment is to look at healthy ageing gene expression changes. Thus, I want all of the data sets from GEO that studied gene expression at different ages (either comparing young to old, or as a time series), and then I want to remove all of the non-wild type/non-healthy/mutants etc, so that I am left with gene expression data for healthy individuals in human, rat and mouse at different time points.
My main issue at this point is how to actually combine the gene expression data sets into one data set, without redundancy, bias or error. I wrote out how I thought I would do this, sent it to GEO, who kindly informed me that I was wrong, and to think again.
So my specific question, at this point, is how to build a meta-data-set from a bunch of different data sets in GEO, and do all necessary filtering/normalising etc to make the expression data from the samples comparable both within and between data sets. So this is what I think I need to do:
In NCBI; GeoDatasets, I searched age[subset_variable_type], extracting 209 ageing-related files. I download the summary file. In the summary file, I can see the GSEXXXX number for each GDSXXXX number.
Using a python script and FTP, I can pull down the CEL files for each GDS file (e.g. one FTP address would be here: /geo/series/GSE52nnn/GSE52550/suppl, I cd to suppl, and get GSE52550_RAW.tar). When I untar the RAW file, there is a CEL file for each sample.
I tried to look at the CEL file in excel, but I think it is not human-readable.
Next I want to process the data to normalise the data within each data set, between samples. I have read that I can do this by reading all of the CEL files into the rma() package in R bioconductor Affy package.
Then I will do a hierarchical clustering and PCA to identify potential outliers in each data set using hclust() and prcomp() in R. Judging from the brief reading I did of the software, I think for this, I read in all of the GSMXXXX files for one particular GSEXXXX data set.
The GDS_full.soft files have all of the probes (ID_ref) and the Entrez Gene Name (GeneID). Make an excel sheet with three columns: probe name, gene name and the inter-quartile range of expression values among all of the probe IDs per data set. For each gene, you select the probe that has the largest IQD across all of the samples per data set.
Using this, I can make an excel spreadsheet, on the Y axis is each gene, on the X axis is each of the samples, and each cell is a normalised expression value.
...so now at this point, I have the data normalised within each data set. So I have 209 gene/sample matrices, but I have not tried to combine the data between different data sets.
- So now, if I do the above steps separately for each data set, are the data sets comparable with each other? i.e. can I compare the expression value of gene 1 in data set 1 with the expression value of gene 1 in data set 2?
In that case, is the next step to make a matrix; on the y axis I have a list of all the genes, and on the x axis I have a list of all of the samples from all of the data sets combined? Then each cell is either an expression value, or if a particular gene wasn't in a particular data set, I can just assign "-" to that cell for all of the affected samples. What quality filtering to I do next to check the integrity of this data set?
Basically, I would appreciate if someone could (1) confirm that I am processing each of the data sets correctly, (2) Confirm/Tell me how to combine all of the different data sets together and (3) since I want to remove some samples from each data set (i.e. I only want the healthy individuals from each data set), should I do all of the quality filtering on the full data sets and then remove the samples at the end, or remove the samples at the start before I quality filter the rest of the data set?
Many thanks