Can one sample t.test be used to determine statistical significance for log2FC values?
2
0
Entering edit mode
14 months ago
aUser ▴ 70

I downloaded proteomic data from CPTAC, but it is already in log2FC form, [while the spectral count lacks the sample mapping]. Now, I want to determine which genes are statistically significant, what test can be applied?

As, logically the ratio of protein expression between Tumor and normal should be 0 [log2(Tum/Norm = 1) = 0], can I use one sample t test (with mu = 0) or Wilcoxon Signed Rank test (gene expression is not gaussian)? or any other method?

For instance, I created a random distribution following min/max of gene expression and then compare the actual gene expression against this random distribution? Is it logical?

The p-values of 1sample T.test, 1 sample Wilcoxon or Wilcoxon with random distribution are in close range (0.89, 0.90, and 0.202)

Any hint in this regard is highly appreciated. ,

log2FC T.test • 1.4k views
ADD COMMENT
4
Entering edit mode
14 months ago

In theory a T-test is only valid where the statistic you are testing is expected to be be normally distributed. Its not clear to what extent this is true for log2Fold changes, particularly if we don't know how they were produced. I suspect that these LFC for highly expressed proteins probably are fairly normal, but that there is a departure from normality when counts are low.

However, in practice, the t-test is fairly robust to departures from normality, and is usually more powerful than the non-parametric alternative, even when the normality assumptions are violated. We don't usually use unmoderated t-statistics on gene expression data, because estimating the variance can be difficult with the low sample numbers usually available, but with 200 samples you should be fine. Although with 200 samples, you are probably also fine, powerwise, with the Wilcox.

One thing I would watch out for however, is your assumption that the expected LFC under the null hypothesis is 0. This will only be true if the data is well normalised, and its not clear from your description what normalisation may or may not have been applied to the data before the calculation of LFCs. On the assumption that most genes are not DE, you could look at the distribution of mean LFCs across the whole dataset and check that it is symmetrical and centered around 0.

ADD COMMENT
0
Entering edit mode

Thank you for the detailed response. I will try again to find how the data have been processed, but so far no luck.

For normality assumption in T-test, expression of few genes follow normal distribution while others do not normal distribution when tested by "Shapiro-Wilk normality test".

" you could look at the distribution of mean LFCs across the whole dataset and check that it is symmetrical and centered around 0." The mean of the mean-values is -0.0005 and mean of median-values is -0.008. I can assume that it is close to 0, but it is not normally distributed rather negatively skewed (shapiro test <<<0.01).

I think I should follow one-sample Wilcoxon signed rank test.

ADD REPLY
1
Entering edit mode

I like i.sudbery's response. With your current data, I would definitely 100% use the Wilcox. You have more than enough samples and you wouldn't be violating any distributional assumptions. You'll get plenty of proteins that are statistically significant and are unlikely to be false positives. Good sensitivity + few false discoveries = meaningful biological results :)

ADD REPLY
0
Entering edit mode
14 months ago
dsull ★ 6.9k

Not sure I understand. If it's log2FC, it's a single number, so no statistical significance testing can be done (at least not at an individual protein level).

You can't just randomly make up numbers and perform a statistical test; to perform a statistical test, you need some estimate of the variance of your protein expression (I don't see any way to obtain such an estimate unless you have the actual numbers).

ADD COMMENT
0
Entering edit mode

For clarification, I have a population of log2FC for each protein from different patient samples (~100 - 200). The entire dataset is ~2500 protein * 200 samples.

ADD REPLY
0
Entering edit mode

Same question remains, logFC relative to what? Patient-1 vs what? Patient-2 vs what?...

ADD REPLY
0
Entering edit mode

Sorry, for a very vague post; The log2fc values are Tumor/normal i.e. log2(Tumor/normal).

ADD REPLY

Login before adding your answer.

Traffic: 2578 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6