Does anyone have experience doing hierarchical regression with an RNA-seq dataset? I was hoping to add a series of dummy variables encoding the five levels of my batch effect to determine which batch if any is having a significant effect
mod.b2 <- model.matrix(~ factDx + batch2)
mod.b2b5 <- model.matrix(~ factDx + batch2 + batch5)
mod.b2b5b4 <- model.matrix(~ factDx + batch2 + batch5 + batch4)
mod.b2b5b4b1 <- model.matrix(~ factDx + batch2 + batch5 + batch4 +batch1)
Etc...
Its pretty easy to set up the models but I am unsure of how to set up/compare the Rsquared values of those models for each individual gene. My thinking was that I could create a list of Rsquared values and then look at the change in Rsquared resulting from the addition of a variable to identify whether the additional variable was having a significant effect on few/many genes and which genes specifically were affected by which batch level. I have not found any pipeline or example attempting this though, and was wondering if anyone else had such an example they could point me to.
It's probably easier and more informative to start out with a PCA plot. See the PCAtools vignette. Check if any of the PCs are segregating your samples by batch.
You make it far too difficult to yourself, this goes for this post and all the others on the basically same topic before. You seem to have a normal RNA-seq dataset, therefore simply follow established procedures rather than coming up with very custom solutions. Perform a PCA or MDS, see whether you have evidence for the batch effect, and then include it into the design given it is not confounded with factDx. That's it, from there on simply follow edgeR manual (which also covers batch effect by the way). Agreed with rpolicastro on PCAtools, that is probably the simplest solution. You can also check for a simple analysis: https://www.biostars.org/p/461026/ which covers both simple batch diagnostics with PCAtools and a DEG analysis via QLF framework with edgeR.