I have my RNASeq counts and the various factors of my experimental setup (condition,strain,pool etc). I am interested to know which of these effects matter and how much so that I can make a decision of how to model them in GLM.
I have found this package called PVCA
which seems to show the proportion of variance explained by each factor and interaction of factors.
If counts
is my count table and met
is my metadata table, I use:
library(pvca)
eset <- ExpressionSet(as.matrix(counts),new("AnnotatedDataFrame",data=met))
pvcaobj <- pvcaBatchAssess(eset, batch.factors=c("bias","diet","line"), threshold=0.6)
df <- data.frame(label=as.character(pvcaobj$label),wmpv=round(as.numeric(pvcaobj$dat),2)
And this returns something like this
label wmpv
1 diet:line 0.04
2 bias:line 0.02
3 line 0.02
4 bias:diet 0.02
5 bias 0.02
6 diet 0.02
7 resid 0.86
So here are my questions.
Which dataset should I use as counts
? They all produce different results.
- raw filtered counts
- cpm transformed counts
- cpm log transformed counts
What does the threshold=0.6
in pvcaBatchAccess()
do?
Are there any other such tools or methods to access batch effects?