Hi Sean, Frymor and Davy,
I am a newbie to microarray data analysis. I've got a data set of microarrays to work on. I was able to run the rma normalization and LIMMA analysis.
Now I have a big table of over 25K genes. If I understood it correctly most of them are not differentially regulated.
The question is of course, how to identify the genes which are significant?
I was using the example offered here to plot my data and got the same curve and lines of p-values. Am I wrong to understand that this plot is the same as the histogram of the p-values distribution which is easily done by various examples and shows a something like this: histogram - image was taken from http://www.tcrt.org
But, now that I have it, what do I do with it?
I know that the plot shows me the distribution of the p-values over my experiments. The person I am working with told me I need to look for the point, where the curve goes flat-lined, but this is as arbitrary as just picking a p-value by chance.
His reason for that choice was - everybody is doing so.
This is what I understood after reading some papers -
- In a normal experiment most of the genes won't be differentially
regulated.
- the adj. p-value is my multiple testing hypothesis correction value.
- the higher this value is, the more false positive I get in my list of DE genes.
But all these doesn't help me to find the right threshold. I read this paper: Estimating p-values in small microarray experiments.
Here they try to explain why permutations is a good idea with small data sets (which I also have - four replicates for three different conditions each).
But still not a clue about the 'right' value
> You are not going to get an answer from this group on the question of what the right number is because there is no one-size-fits-all number to use. If you are unclear about how to interpret your results, I suggest you find a local collaborator who can work with you on your data.
I can understand Sean's saying it is difficult to get an answer to such a question, as there is no straight or direct answer for that. Each experiment is a single, unique data set. But I am sure there is some kind of method to define the right p-value by looking at the distribution of the data.
AN explanation might be that I am to choose the point where the curve is getting flatter for the reason that at this point I will have the highest number of significantly differentially regulated genes with the smaller number of false positive in this data.
Is this an explanation which can stand?
Thanks for the help?
Alex
You cannot use the raw p-value, unfortunately, as it is not corrected for multiple testing. Besides the comment from neilfws, you might also consider using geneset-based testing like the limma romer and roast functions or a package like globaltest. The idea is to capture biological signal from multiple genes simultaneously where each gene, taken individually, is uninformative. This is a related concept to that in @Davy's answer.
Like Sean says, the adjusted p-value is the one to use, because you have many more variables (10000) than samples (3 x ? arrays). It's not uncommon to get no significant DE using limma, particularly with a small number of replicates or a large number of probes. I'd suggest giving siggenes a try too - http://bioconductor.org/packages/release/bioc/html/siggenes.html - if that gives no DE except at high FDR values, you'd have to conclude that the arrays are uninformative.
by 10K I meant the complete table from LIMMA, not DEG. That's exactly the problem though. In the experiment I have three replicates for each array. After running LIMMA I have one adj.p.val of 0.717 and the rest =1. My p-values are also quite high - all in all I have 34 genes with a p-value<0.01. This is why I ask if there is any point in going higher with the p-value like p-val=0.1 (488 genes). Tomas