My experiment is the following:
- Temperature 26C vs. Temperature 30C
- Treatment Saline vs. Treatment BMC
- Timepoints 0, 2, and 3.
My intention is to create GLM model to look at differential abundance between groups:
- 26C vs. 30C for Saline treatment
- 26C vs. 30C for BMC treatment
- Saline vs. BMC at 26C
- Saline vs. BMC at 30C
In doing so, I would like to include all the samples possible when calculating the dispersion which is why I'm using a GLM instead of a Fisher's Exact Test for subsets of samples. I would also like to incorporate the ordered time information.
I'm trying to understand why my GLM model is failing. I tried using all of my data including the baseline T0 which was an issue with my design matrix because it was not of full rank. I understand why the baseline was an issue as it didn't add any information for any of the other categories. However, I'm not sure why this occurs for Temperature_30C
.
Here's my data, code, and experiment design. I'm using Python as my main environment with an R package called edgeR
in the backend using an interface called rpy2
if this helps having this info.
from rpy2 import robjects as ro
from rpy2 import rinterface as ri
from rpy2.robjects.packages import importr
from rpy2.robjects import pandas2ri
R = ro.r
NULL = ri.NULL
random_state=0
ro.r('set.seed')(random_state)
# Package
edgeR = importr("edgeR")
limma = importr("limma")
def pandas_to_rpy2(df, rpy2_version_major=None):
if rpy2_version_major is None:
from rpy2 import __version__ as rpy2_version
rpy2_version_major = int(rpy2_version.split(".")[0])
if rpy2_version_major == 2: # v2.x
return pandas2ri.py2ri(df)
if rpy2_version_major == 3: # v3.x
from rpy2.robjects.conversion import localconverter
with localconverter(ro.default_converter + pandas2ri.converter):
return ro.conversion.py2rpy(df)
def build_glm_model(X, design_matrix, random_state=0):
components = X.columns
# Run edgeR
r_X = pandas_to_rpy2(X.T)
r_components = ro.vectors.StrVector(components)
r_design = pandas_to_rpy2(design_matrix)
# Give DGEList the gene expression table with groups
model = edgeR.DGEList(
counts= r_X,
genes = r_components,
# group = r_y,
)
# Normalize the expression values
# if kws["calculate_normalization_factors"]:
model = edgeR.calcNormFactors(model)
# Calculate dispersion
model = edgeR.estimateGLMCommonDisp(model, r_design)
# if kws["estimate_glm_trended_dispersion"]:
model = edgeR.estimateGLMTrendedDisp(model, r_design)
# Estimate tagwise dispersion
# if kws["estimate_glm_tagwise_dispersion"]:
model = edgeR.estimateGLMTagwiseDisp(model, r_design)
# Fit
fit = edgeR.glmFit(model, r_design)
return (model, fit)
# Counts
X = pd.read_csv("https://pastebin.com/raw/J7kmL8Ly", sep="\t", index_col=0)
# Metadata
df_metadata = pd.read_csv("https://pastebin.com/raw/PANaC3r5", sep="\t", index_col=0)
design_matrix = pd.concat([
pd.get_dummies(df_metadata["treatment"].fillna("Baseline"), prefix="Treatment"),
pd.get_dummies(df_metadata["temperature"], prefix="Temperature"),
df_metadata["collection_time_numeric"].to_frame("t"),
], axis=1).astype(int)
# Remove the baseline samples
design_matrix = design_matrix.drop("Treatment_Baseline", axis=1).query("t > 0")
# Create the GLM model
model, fit = build_glm_model(X.loc[design_matrix.index], design_matrix)
# RRuntimeError: Error in glmFit.default(y, design = design, dispersion = dispersion, offset = offset, :
# Design matrix not of full rank. The following coefficients not estimable:
# Temperature_30C
Also posted on Cross Validated https://stats.stackexchange.com/questions/521917/
Answered by Gordon here: How to structure my edgeR model to incorporate a temporal variable and 2 categorical variables?