Hi
I've a dataset of gene expressions of 102 patients and 9 healthy controls. I downloaded this dataset from GEO, I applied several preprocessing steps (normalization, batch correction based on date, etc), and I was finally able to generate a table containing:
the individuals on the rows
the genes on the columns
- each entry ij containing a real value that indicates the expression of the gene_i in the individual_j
This first preprocessing phase was a lot of effort. Now I would like to perform a differential gene expression analysis, to see how the genes expressions differ between the patients and the healthy controls.
I checked some packages online (such as DESeq2), and I noticed they all have specific requirements for input files, that need to contain raw counts. Unfortunately, I don't have raw counts.
I would like to perform a differential gene expression analysis by myself, by taking advantage of biostatistics R functions applied on my preprocessed tables.
How can I do it? Any suggestion?
Thanks!
Adding on this, what you have is array data which provides you with intensity values, not counts so a relative measure of gene expression rather than absolute counts as in RNA-seq.
limma
seems to be pretty much the standard and following their workflow should get you the intended results. Be sure to read the manual thoroughly and also look at this end-to-end workflow for Affymetrix microarrays.Thank you guys for your replies. There's a lot of material online and I feel like I am drowning in it. I found this interesting question and answer here on BioStars.org, that I tried to implement for my case. I used
lmFit(table)
andeBayes(fit)
, as explained, withoutdesign
.I was able to generate a table with the values of the fitted model for the patients, and a table for the healthy controls. This is the head of the topTable of the patients fit:
head(topTable(fit, n=Inf, sort="p", p.value=0.05))
Some questions:
1) What is the meaning of these p-values associated to each gene that I found this way?
2) Was it a good/useful idea to split the patients and healthy controls into two different tables and perform the analysis separately? Or should I keep them together and insert this information into the
design
parameter? If the latter, how?Thanks!