Hi all,
I have a statistics question I've been thinking about for a while but for which I didn't find a satisfactory answer so far. I would really like to hear your thoughts about it and I apologize in advance if a similar question was already asked (I couldn't find it).
To summarize the problem, let's imagine to have two sets of transcripts (transcripts A and transcripts B) and to be interested in understanding whether
1) transcripts in group A are more/less likely to contain CG-rich regions in their body than transcripts in group B
2) for transcripts in group A the proportion of sequence (base pairs) that sits in a CG-rich region is higher/lower than in group B
In principle (in my understanding) question 1) could be addressed using Fisher exact test (by creating a 2x2 contingency table reporting how many transcripts in A overlap a CpG island, how many do not, how many transcripts in B overlap a CpG island and how many do not). To start with, is this correct? Concerning question 2), however, it seems strange to apply the same strategy, because the numbers are much larger and therefore p-values tend to be very small also for very small differences...
I am honestly quite confused at the moment and I would find it extremely helpful to have some feedback.
Thank you in advance for any help you might be able to provide!
Hmmm you say it would seem more interesting to ask if the CG content differs between the two groups, and this would answer exactly my second question. But then what you mean is to just quantify the CG content for each transcript in both groups and throw the vectors in a Wilcoxon test? E.g. bluntly count the % of the sequence that is C or G and see if on average transcripts in group A are more/less CG rich than transcripts in group B?
I should have clarified that without knowing the biological background to you asking the question, that I'd find it more interesting to look simply at GC differences. For that, yeah, just throw the vectors in a Wilcoxon test, which is simple enough.
Perhaps you're looking at differences in CpG island occupancy of coding regions or something like that, in which case my answers to (1) and (2) above would probably be more useful.