Search group specific genes in presence/absence matrix of pangenome
0
0
Entering edit mode
2.8 years ago

I have a large presence/absence matrix of genes from different strains of S. aureus that I obtained through microbiome shotgun sequencing. I conducted a PCA and observed differences between the strains living in the conditions (healthy and diseased). Now, I would like to find out which genes drive this variation. My presence/absence matrix comes with genes in columns and observations/samples in rows. I have a grouping variable as class factor, which specifies which samples belong to which condition (i. e. healthy or diseased).

> head(gene.matrix)

                gene1     gene2      gene3
sample1           1        0             0
sample2           1        1             0
sample3           0        1             0

in total I have >3500 genes and 12 strains.

> grouping.variable

[1] healthy healthy diseased

How can I find out, which genes are (preferentially) present in the strains on diseased sites? Through a loop, I already tried a glm Call: glm(formula = Gene_x ~ condition, family = binomial(link = "logit"), data = gene.matrix) and received a list with an entry for every single gene that holds the results of the glm call. Is this appropriate? Can I just filter out those genes with a pvalue <0.05? And If yes, how can I do that? Would I need to adjust the pvalue for multiple testing?

R statistics binary microbiome pangenome • 508 views
ADD COMMENT

Login before adding your answer.

Traffic: 2655 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6