My data represents the count of mutations by gene in two groups cases and ctrl. I would like to understand if there is a significant difference between the number of mutations for the same gene between the two groups (gene-wise comparison). In short, I have to compare means grouping by two variables (gene and Type), but in my case I don’t think anova is appropriate since count data should follow a Poisson distribution.\ Here a small subset of my data, as you can see the samples size (cases and ctrls is different).
gene Samples_ID value Type
AB 1 28 Cases
AB 100 22 Cases
AB 101 36 Cases
AB 102 57 Cases
AB 105 29 Cases
AB 106 25 Cases
AB 108 23 Cases
AB 4928 18 Ctrls
AB 4929 18 Ctrls
AB 4930 24 Ctrls
AB 4931 20 Ctrls
AB 4932 25 Ctrls
AB 4933 15 Ctrls
AB 4934 25 Ctrls
AB 4935 22 Ctrls
AB 4936 30 Ctrls
AB 4937 15 Ctrls
AB 4938 18 Ctrls
AB 4939 21 Ctrls
Where "gene" is the gene name, "Samples_ID" represent the patient, "value" is the number of mutation for the given gene, "type" is the group.\
I have tried a generalized linear model as follows using R because i found many similar questions:
fit <- glm(value ~ gene + Type, data = file, family = poisson())
but I'm not convinced it's the right way and I don't know how to proceed. My desired output would be a table with gene name as row, the p-value, adjusted p-value in the columns, in this way I can understand if for a given gene, the mean number of mutations differs in a significant way between groups (cases and ctrls).
PS
Please, give me a simple explanation, I'm not a statistician and my background about statistic is ~ 0.
Thank you in advance.
Best.
Thank you very much for the helpful answer. That was exactly what I needed.
Regards the biases that could influence these results. This subset of data come from RNA seq, the mutations were filtered; they (or at least most of them) represent RNA editing events (this to specify they are not random mutational events but it's even true that as longer the gene is higher could be the number of editable sites).
So I agree with you about the two main source of bias.
But in this specific case, the coverage could be not represent a problem since the considered mutations came from the same regions covered at least by 50 reads across the samples. Regarding the gene length, it could be not a problem if I would compare the same gene between the two conditions. Vice versa, it could bias my results in the case I would like to compare genes between each other finding for example, the most mutated gene.
In that case, it could be possible to bypass this issue by dividing the number of mutations per transcript over the length of the corresponding transcript (e.g retrieving the transcript length from sequencing data) and then using another the statistical approach, maybe using beta regression model. I don't know if the last approach is right, it's the first idea off the top of my head.
or if you do an enrichment analysis: longer genes are more likely to be found to be significantly mutated because there is more power to detect a significant difference in this case. Because of this, there dividing by the length of the gene won't help. You'll still have more power (or conversely, the estimated standard deviation will be lower for the longer genes). The correct way to deal with this is to account for the gene length bias explicitly in the downstream tests - so if you do an enrichment, use a tool that accounts for gene length, such as GOseq.