Dear all, I need your help for problem I faced in my analysis..
I combined multiple RNA seq datasets, adjusted batch effect by using ComBat Seq method and I wanted with batch adjusted counts to perform GSVA analysis followed by limma statistics. Problem I faced was GSVA output where I realized that I have significant results based on p value and p adjustment but my LFC is quite low going from 0.2 to -0.2 approx. This is output when I used counts as input but I also tried vst and log cpm (generated using EdgeR) data where I still got low LFC and In addition p adj was above 0.7.
Regarding data used as input I used instruction from vignette: https://www.bioconductor.org/packages/release/bioc/vignettes/GSVA/inst/doc/GSVA.html#3_Overview_of_the_GSVA_functionality where is written:
When using gsvaParam(), the user can additionally tune the following parameters, whose default values cover most of the use cases:
kcdf: The first step of the GSVA algorithm brings gene expression profiles to a common scale by calculating an expression statistic through the estimation of the CDF across samples. The way in which such an estimation is performed by GSVA is controlled by the kcdf parameter, which accepts the following four possible values: (1) "auto", the default value, lets GSVA automatically decide the estimation method; (2) "Gaussian", use a Gaussian kernel, suitable for continuous expression data, such as microarray fluorescent units in logarithmic scale and RNA-seq log-CPMs, log-RPKMs or log-TPMs units of expression; (2) "Poisson", use a Poisson kernel, suitable for integer counts, such as those derived from RNA-seq alignments; (3) "none", which will perform a direct estimation of the CDF without a kernel function.
I used intiger counts with Poisson kernel and Gaussian kernel for log-CPM data which I generated using code from EdgeR package cpm(count.table, log=TRUE). For VST data i used auto as default.
PS Data input included whole set of genes ~17 000 for 160 samples
Code for GSVA:
gsvaPar <- gsvaParam(as.matrix(count.table), re_gene_sets, minSize = 10, kcdf = "Poisson") # reactome
re.count <- as.data.frame(gsva(gsvaPar, verbose=FALSE))
gsvaPar <- gsvaParam(as.matrix(count.table.cpm), re_gene_sets, minSize = 10, kcdf = "Gaussian") # reactome
re.count <- as.data.frame(gsva(gsvaPar, verbose=FALSE))
gsvaPar <- gsvaParam(as.matrix(count.table.vst), re_gene_sets, minSize = 10) # reactome
re.count <- as.data.frame(gsva(gsvaPar, verbose=FALSE))
# The same was done for GOBP, Hallmark
RNR.gsva <- rbind(ha.count, re.count, bp.count)
data_matrix <- as.matrix(RNR.gsva)
library(limma)
# Fit linear model
fit <- lmFit(data_matrix, design)
# Compute moderated t-statistics using eBayes
fit <- eBayes(fit)
# Extract features associated with Response (R vs. NR)
gsva_RNR <- topTable(fit, coef = "ResponseR", number = Inf, adjust = "BH")