Hi!
I have received a dataset from an old proteomic experiment which contain the SILAC ratio for 3834 proteins for a determined condition. I'm used to use statistical metrics as the p.value associated to a T-test to establish a cut-off for the consequent over-representation analysis, however, this ratio is not accompanied by any type of statistic. I got 3 technihcal replicants.
I'm wondering if exists a method to establish a statistic to this collection of ratios and the librarie/package/software to make it. References would be appreciated
Thank you!
Please use the "add comment" button when replyin to an answer, this keeps the discussion organized.
I assume that the ratio is between two conditions, something like treatment over control. In this case, what you want to test is whether there is a difference in expression between the two conditions which translates into the ratio being significantly different from 1. In the log-transformed space, you then test the null hypothesis that the value is 0. If the mean of your log-transformed data is not 0, you would need to center the data before doing the test.
Yes, that's exactly what they are, ratios between two conditions. If the data were succesfully centered at 0 i would get the following output from r-base t.test function, no?:
t.test(data$mean)
One Sample t-testdata: data$mean
t = 0.25743, df = 3833, p-value = 0.7969
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
-0.01591990 0.02073247
sample estimates:
mean of x
0.002406286
So once i have checked that they are succesfully centered at 0 and following a Gaussian distribution, how should i continue to obtain a statistic to get those values that represent a significant change in this distribution? Simply by getting the extreme 2,5% of ratio values?
If you want to formally test if your data is normally distributed, do a Shapiro-Wilks normality test (shapiro.test()), don't do a t-test. To select proteins with significant change, you test each protein using the replicates (i.e. test if the mean of the replicates is equal to 0). However, with only three replicates and correction for multiple testing, this approach may not have enough power. However, I believe statistics are not the answer to your problem here. I would select proteins whose median over replicates is above a given threshold. Using the median enforces reproducibility, i.e. at least half the replicates will be above threshold. Use prior knowledge to find a biologically-relevant threshold. For example, if key players in the process you're interested in are known to change, you could use this to select the threshold. Or if key players are known but not their change, you could rank the proteins based on fold change and look at how many of these known players you recover at different thresholds.
Okey, i will use a shapiro test!
Effectively, after using the t.test as you specified, the number of proteins with a p-val lower than 0.05 is 187, transforming into 0 when applying the FDR correction.... The prior-knowledge approximation sounds so good for this situation, because i have previously experimental evidence of proteins that change under the condition studied. I will try it.
Thank you again, your wisdom is appreciated!