You can definitely use this kind of data to test the hypothesis that a positive correlation exists, but you need to perform a statistical analysis that uses all the data points and not the means directly. Here are some options I can think of:
Linear model
On the face of it, one might think that a linear regression should suffice:
# R
myLinearModel <- lm(fold_change ~ motif_count, data = df)
summary(myLinearModel)
# Coefficients ... etc.
# F-stat: , p-value: 0.0001
However, since you know you have some groups with a small number of samples (e.g., only one with 24 occurrences of the motif), the burden is going to fall on you to prove that none of the points have a disproportionate influence on the regression coefficient(s). So, you would additionally have to do some leverage or influence analysis on your resulting model. For example, checking that you don't have any high Cook's distances:
cooks.distance(myLinearModel)
# wait...how do I interpret these again? something about 3...
Binning
Even if that works out, there are two potentially problematic issues:
- the assumption of continuity, when we really have discrete counts - maybe not so big a deal though; and
- the assumption of linearity, when we probably doubt that adding 5 motifs to a sequence already containing 20 will have the same effect as adding 5 to a one with only 2.
It might be more useful to bin the motif counts into levels, like "low", "medium", and "high", and do something similar with the fold change ("down", "neutral", "up"). You could then use chisq.test
to test for independence (null hypothesis).
If you had some idea for how to split this prior to looking at the data, that would be ideal - but you've already peeked at the data, which means you need to be careful about making biased choices in your analysis. Hand-picking bins at this point could be construed as cherry-picking your statistical test.
Ordering
Another option is to do an ordinal ANOVA, such as provided by the ordAOV
method in the ordPens
package. In this approach, you won't make any assumption about the scale of effect size difference between your different motif count groups, and you'll also control for the variance within each group. To do this, you instead would use a motif count rank, in place of the motif count itself. Here's what the test would look like in R:
library(ordPens)
rankedMC <- factor(df$motif_count, ordered = TRUE)
levels(rankedMC) <- seq_along(levels(rankedMC))
rankedMC <- as.numeric(rankedMC)
ordAOV(rankedMC, df$fold_change)
# Test stat = ..., p-value = 0.0005
Another nice thing here is that the test is based on simulations from the empirical distribution, so you don't have pesky parameters or distribution assumptions to fret over.
![enter image description here][1]thank your mmfansler,
the linear model was the first thing i thought to find out the relationship between count and FC, but i give up after plot all points on the same figure, it is not linear at all. Your suggestion about cocks.distance() is great, at least gives me some ideas about these outliers.
binning and ordering also help me understanding what the my data is. Thanks a lot