How to extract expression level from a .CEL file?
Reading the fluorescent intensities in the CEL files, initially, will depend on the microarray manufacturer and version. If you let me know which one you are using, then I can guide further. Once you have read the CEL files into a Expression Set object, the subsequent steps are fairly standard for the majority of cDNA microarrays.
---------------------------------------
How to collapse probe expression levels into gene expression levels?
project.bgcorrect.norm.avg <- rma(project, background=TRUE, normalize=TRUE, target="core")
project.bgcorrect.norm.avg.Exons <- rma(project, background=TRUE, normalize=TRUE, target="probeset")
Usually this is performed during the normalisation, which is performed using rma()
or gcmRA()
. These perform background correction, quantile normalization, and then transform by log base 2 (in the case of gcrma()
, expression values are also adjusted for probe and target sequence GC bias). More specifically:
- Summarise by gene:
rma(..., background=TRUE, normalize=TRUE,
target="core")
- Summarise by probe / exon:
rma(..., background=TRUE,
normalize=TRUE, target="probeset")
-----------------------------------------
What's the best way to compare microarray samples with RNAseq samples?
I understand it's not advised to do so, but what I have is a set of
control samples in microarray, and a set of treatment samples in
RNAseq.
Yes, why do you want to do this? The best that you can do is normalise them each as per their respective recommended guidelines, and then get them on the same data distribution (e.g. both log2 expression values). After that, you could convert these to the Z scale and then perform a simple / manual merge. If a transcript exists in one but not another, then there's nothing that you can do - it has to be eliminated.
-------------------------------------------
Can I use edgeR for the microarray datasets after some preprocessing
(e.g. normalization)? I have developed a whole pipeline for DEG
analysis based on edgeR (e.g. volcano plot, MDS plot, heatmaps), and
it would be nice if the microarrays can be fed into edgeR.
You should make your functions as reproducible as possible. MA, volcano, box, etc plots are all applied generally to different types of data; therefore, take the opportunity to adapt your functions for general use so that you can re-use them again and again. As a start, here's some code for a simple volcano plot (to convert this to a MA plot is easy): A: Volcano Plot from DEseq2
Here are some other ideas: A: Hierarchical Clustering in single-channel agilent microarray experiment
Kevin
About DGE of microarray data using edgeR: RNA-seq gene count falls into negative binomial distribution, and microarray probeset count falls into another (normal after log transformation?) distribution. And maybe logCPM should be replaced by log(count) for heatmap? I thought everything else would be the same? If so, then I could modify the distribution parameter, replace logCPM with log(count) and then use the whole pipeline? Reimplementing the whole thing again for microarray seems to be a daunting task. :)
Thanks a lot for the example of volcano plot, too. I have it implemented for edgeR.
For actual differential expression analysis, I would just use limma for your microarray data, which is the standard. As you implied, EdgeR expects a certain distribution of count values. Microarray and RNA-seq count distributions are inherently different, even after normalisation and log transformation.
Once you then obtain test statistics, you could use downstream functions of EdgeR, but I do not really see the point. If you do, I would just be very careful because, as an example, the heatmap function of edgeR is most likely performing commands that differ from the base heatmap functions.
I have made various postings on heatmaps, which you could follow: