Dear all,
I used this workflow to analyze the Illumina microarray data, GSE35088. However, I did not obtain DE genes while GEO2R gave about 1000 DE genes. As I found that GEO2R does not perform the filtering step while when I filtered the probes (control probes, those with no symbol, and those that failed), only about 800 probes remained out of 22523 probes, is it usual or something is wrong?
Also, I guess the normalization step is different between the two workflows, yes?. so using GEO2R would not be safe for getting accurate results, is it right? kindly let me know if you have any suggestion/advice?
thank you.
Thank you, Gordon.
Yes, it is not reasonable in my view, too. However, after normalization, I filtered the un-expressed (failed) probes that have detection p-value > 0.05. I kept those probes with the detection p-value <= 0.05 in at least 3 arrays as it is the default of the limma if I correctly remembered.
Your valuable suggestion would be highly appreciated
A probe being unexpressed is not the same as "failed", nor does it mean that the probe is of poor quality. I do not know of any circumstances where an Illumina beadchip probe can be said to have "failed". A probe that reports that the corresponding gene is not expressed is doing its job correctly.
Anyway, I suspect that you may have been tricked by the fact that Illumina sometimes reports detection p-values such that p < 0.05 means expressed and sometimes reports p-values such that p > 0.95 means expressed. In other words, the detection p-values are sometimes 1 minus what you expect them to be. I suspect that these arrays are using the latter version whereas you're assuming the first. If you used the limma functions
read.ilmn
andneqc
to process the arrays, then limma automatically checks which way around the p-values are.For this dataset, when I check how many probes are significantly above background in at least 3 arrays, I get the following:
Note that probe filtering for this dataset should take into account the design of the experiment, which in this case includes technical replicates and about 8 arrays per experimental condition. Checking detection in >=3 arrays is not a universal recommendation or a default in limma.
Thank you very much.
Half of genes are truly expressed, which is completely reasonable. Sorry, could you please let me know how we should find 0.05 or 0.95 detection p-value cutoff?
Just check whether the detection p-values increase or decrease with the intensities for each array. Any quick look at the detection p-values will tell you which of those is true.
Many thanks for your support!
Sorry, in the case of not filtering probes based on the detection p-value, I noticed that the number of DE genes is about two-fold more than when I keep only subset of genes. It is probably related to the more accurate variance estimation in the presence of all genes/probes, isn't it? so, in this way, obtained DE genes are reliable? could you please share your suggestion on this issue, keeping all genes or a filtering based on detection p-value?
Thank you