Hi!
I have experience with other types of sequencing data (e.g., bulk RNA seq), but I'm new to 16S rRNA sequencing data analysis. I've searched through a few papers on the topic (e.g., here, here, and here) and read through lots of online forum posts, but I haven't been able to find an answer to a question that I have.
Dataset: I have a longitudinal dataset with 3 timepoints (visit 1, visit 2, visit 3). 16S rRNA sequencing was performed at each timepoint for each subject. I have a variety of continuous and categorical clinical variables that we also measured at each visit.
Goal: I want to associate changes in specific taxa over visits (independent variable) to concurrent changes in other variables (dependent variables). E.g., to relate changes in taxa abundance from visit 1 to visit 2 (independent variable) to the change from visit 1 to visit 2 in resting heart rate (dependent variable). I want to do this for all my taxa (hundreds of individual statistical tests).
Past approach with non-16S variables: On a much smaller scale, with variables other than the 16S data, I have been using linear mixed effects models (implemented in the lme4 R package). I'm particularly interested in changes in variables between visit 1 and visit 2 (change between visit 1 and visit 2 will be indicated here as V1V2). I also care about V1V3 for dependent variables (i.e., the change in variables from visit 1 to visit 3), and I generally use a linear mixed effect model to address both questions.
As an example for this post, let's suppose that I'm interested in V1V2 and V1V3 resting heart rate (dependent variable) and V1V2 sleep quality (independent variable). Using the lme4 package's notation, I could set up a model like this to ask whether V1V2 sleep quality relates to V1V2 and V1V3 resting heart rate:
lmer(heart_rate ~ v1_heart_rate + v1_sleep_quality + visit*V1V2_change_sleep_quality + (1|participant_id), data=outcomes_V23)
where the data frame that I pass in (outcomes_V23
) has been filtered to only have rows for visits 2 and 3, and each row has a column representing the visit 1 heart rate (v1_heart_rate
) and the visit 1 sleep quality (v1_sleep_quality
), along with a column for V1V2_change_sleep_quality
, which was directly calculated in my script.
A simplified version of outcomes_V23
with only 2 participants might look like this:
participant_id visit heart_rate v1_heart_rate v1_sleep_quality V1V2_change_sleep_quality
1 2 100 110 5 2
1 3 90 110 5 2
2 2 70 90 8 -1
2 3 77 90 8 -1
To be clear, each of a participant's rows will have the same value for v1_heart_rate
, v1_sleep_quality
, and V1V2_change_sleep_quality
.
This set up allows me to ask whether a V1V2 change in an independent variable is associated with a V1V2 or V1V3 change in a dependent variable.
Current problem: I know that longitudinal microbiome data presents some unique statistical challenges, and benchmark papers (e.g., here) suggest that statistical approach particularly matters with this kind of data to avoid an inflated FDR. I have been unable to find a tool (designed for 16S data) that allows me to control for visit 1 relative abundance of taxa.
Specific questions:
1) Are there microbiome-specific tools that exist that allow users to control for baseline (i.e., visit 1) relative abundance of taxa with longitudinal data? I imagine that the tool would essentially run tests like this for each taxa: outcome ~ v1_outcome + v1_taxa + visit*V1V2_taxa_change
.
2) If there are not, are there any statistical approaches that will allow me to get similar answers, even if it is not modeled exactly the same way? (I suppose that I could just run a ton of mixed effects models and apply a correction for multiple testing afterward, but this seems unlikely to be a robust approach for microbiome data, based on what I've read, along with other caveats).
Thank you for your time and help!
Jeremy
Edit: For future people with this same question, because of the statistics focus of this question, I also posted it here in case the folks who can help are over there. :)