I'm predicting that certain genes (n=20) are more GC rich than expected in all protein-coding genes. How can I test this hypothesis efficiently?
I generated the same number of random genes (also n=20) and calculated GC content for them. After using the t-test (2-tailed, two-sample unequal variance) it shows a significant P < 0.005 result. But I'm not sure that I can interpret it with high confidence.
The fraction of GC content in genes is generally (0, 1), so you can likely model it with a beta distribution, and thus test your hypothesis with a beta regression.
I worry a little about the imbalance between the groups (all genes versus just 20), so a monte carlo simulation might be safer. The null hypothesis for your experiment is that the GC content of your twenty genes is no different than randomly selecting 20 genes from the genome. To test this with a simulation for 10,000+ iterations randomly sample 20 genes from the genome and check whether the mean GC content of these genes is greater than the mean GC content of your 20 genes. Your 20 genes should be greater than the randomly sampled 20 genes more than 95% of the time.
I generated 1000 control group and only in 1 group average from observed data was not greater than control. so, it has 99.9% confidence that GC of my observed data is different from randomly selected genes.
Can you suggest some way to illustrate this result as well? last time, when I was using 1 control group, I made a boxplot, but now I can't make it as I have 1000 control group. Maybe you can think of some way to represent this result graphically as well? Thank you again, in advance
I would just make a histogram of the mean values you got from all of your simulated sets, and mark the mean value of your experimental set on the plot.
You have no idea, how much you helped me. I'll do like that, I was looking from a complex viewpoint before this comment.
I generated 1000 control group and only in 1 group average from observed data was not greater than control. so, it has 99.9% confidence that GC of my observed data is different from randomly selected genes.
Can you suggest some way to illustrate this result as well? last time, when I was using 1 control group, I made a boxplot, but now I can't make it as I have 1000 control group. Maybe you can think of some way to represent this result graphically as well? Thank you again, in advance
I would just make a histogram of the mean values you got from all of your simulated sets, and mark the mean value of your experimental set on the plot.