I have a large presence/absence matrix of genes from different strains of S. aureus that I obtained through microbiome shotgun sequencing. I conducted a PCA and observed differences between the strains living in the conditions (healthy and diseased). Now, I would like to find out which genes drive this variation. My presence/absence matrix comes with genes in columns and observations/samples in rows. I have a grouping variable as class factor, which specifies which samples belong to which condition (i. e. healthy or diseased).
> head(gene.matrix)
gene1 gene2 gene3
sample1 1 0 0
sample2 1 1 0
sample3 0 1 0
in total I have >3500 genes and 12 strains.
> grouping.variable
[1] healthy healthy diseased
How can I find out, which genes are (preferentially) present in the strains on diseased sites? Through a loop, I already tried a glm Call:
glm(formula = Gene_x ~ condition, family = binomial(link = "logit"), data = gene.matrix)
and received a list with an entry for every single gene that holds the results of the glm call. Is this appropriate? Can I just filter out those genes with a pvalue <0.05? And If yes, how can I do that? Would I need to adjust the pvalue for multiple testing?