I have data that contains the occurrences of genes in different lineages:
Lineage Gene Gene_function
1 x regulatory proteins
1 p cell wall
1 y conserved hypotheticals
1 x respiration
1 z respiration
2 w cell wall
2 a cell wall
2 y regulatory proteins
3 b respiration
3 x conserved hypotheticals
3 a regulatory proteins
3 b regulatory proteins
3 z conserved hypotheticals
3 a respiration
How do I test if there are a significantly different number of, say, "cell wall" genes between all the lineages (I'm thinking equivalent to a classic ANOVA, followed by Tukey test to identify which specific lineages are different). N.b. there are a different number of rows for each lineage.
Then repeat this for each of the types of genes.
Is there a simple and quick way to do this in R?
Thanks, so I tried this but there were a couple of problems, but I think I found a solution. I put the data in a table using
table()
. When I runfisher.test()
I get an errorFEXACT error 6. LDKEY=617 is too small for this problem...
. Something to do with memory. I subsetted the table and the maximum matrix it works with is 2x3 (or 3x2). I did find on another forum however thatchisq.test(<table>, simulate.p.value = T)
is 'equivalent' to Fisher's exact test.However, then I found that
fisher.test()
also has this argument. Indeed, they produce similar results. I'm not sure however what this parameter means!Thanks
According to the documentation, the fisher.test method only applies the simulate.p.value parameter in larger than 2 by 2 tables (a logical indicating whether to compute p-values by Monte Carlo simulation, in larger than 2 by 2 tables). I am not an statistician, but I assume that this parameter speeds up the p-value calculation in such tables by simulating them instead of doing an empirical (and more accurate) calculation: