Question

Difference Between Rma Analysis Of Cel Files And Data From Geoquery Of Array Data

0

Entering edit mode

11.6 years ago

J.F.Jiang ▴ 930

Hi all,

Just a discussion,

For microArray data, there are always two ways to obtain the expression value for probe across the samples,

1) download the original CEL files, then use ReadAffy & rma to get the matrix OR use justRMA directly

2) use GEOquery to obtain the matrix directly

However, I found somehow minimal difference between these two method, but I do not know why?

Another question is that can I use the matrix from GEOquery to directly do differential expression analysis as outputs of rma?

And which one is better for DE analysis, 1) using probe level 2) using gene level Because one gene may point to several probes, when we carry out DE analysi, one step is to obtain the DE output which needs p.adjust, so the question is that the array may have 50K probes but only have 20k genes, which may results quite different results.

Anyone can answer these questions?

Thanks

array • 5.7k views

ADD COMMENT • link updated 11.6 years ago by Neilfws 49k • written 11.6 years ago by J.F.Jiang ▴ 930

0

Entering edit mode

I always use the expression matrix directly. The difference between the two methods can be ignored, array data is not so accurate. I don't know the choice between probe and genes.

ADD REPLY • link 11.6 years ago by jlshi.nudt ▴ 240

0

Entering edit mode

Maybe I am so quite agree with you, I do think for gene expression analysis, array seems more accurate than RNASeq, using VST or RPKM value. The great advantage of RNAseq I think is the great ablity to hold all genes and special for those low transcribed genes.

If I am misunderstanding, plz correct me.

ADD REPLY • link 11.6 years ago by J.F.Jiang ▴ 930

score 6 · Answer 1 · 2013-04-25

Where raw data (CEL files) are available, you should use them. Simply for the reason that you can never fully trust data that has been processed by someone else, unless what they did is absolutely explicit.
You can expect "minimal" differences in RMA values between different implementations. If raw data are not available on which to perform normalization yourself and you are comfortable with the available processed data matrix, by all means use it. By "comfortable" I mean you understand what kind of values it contains, how they were derived and that they "look sensible" (for example, are not in the hundreds or thousands if log2 transformation was supposedly transformed).
Neither probeset-level nor gene-level data are "better" for DE analysis: it all depends what you are trying to achieve. Using multiple probesets per gene can be informative if you are interested in splice variants or in evaluating how good probesets are as measures of expression; some may be more "responsive" than others.

In general, the most differentially-expressed genes in a gene-level analysis will also have the most differentially-expressed probesets. Simply because gene-level values are a rather crude summary, most often obtained by taking the median of (core) probesets for a gene.