Hey everybody,
I have a question about the Surrogate Variables Analysis(SVA) implementation. I have an RNA-seq dataset counts matrix that I am piping into DESeq2 to identify differentially expressed genes. When the RNA libraries were prepared the cells were grown in batches with one wild-type strain per batch. I have included this variable into the DESeq2 ~Design as Sample_Match. To essentially explore and see if there were any other batch effects I implemented svaseq on the counts matrix and included 2 of the significant surrogate variables found to see if the results changed.
My assumption was that the surrogate variables found by svaseq would capture variations in my dataset not explained by biological variations, and thus the number of differentially expressed genes would increase. When I add two surrogate variables to the DESeq2 analysis two of the three genotypes have more differentially expressed genes while the third had less (~200 of the original ~1800 genes).
In this case is svaseq capturing biological variation? And am I implementing it correctly?
Relevant code below:
# Dependencies
require(Rsamtools)
require(DESeq2)
require(GenomicFeatures)
require(GenomicAlignments)
require(genefilter)
require(pheatmap)
require(RColorBrewer)
require(ggplot2)
require(sva)
require(VennDiagram)
# Import & pre-process Count Data for DESeq2----------------------------------------------------------------------------------------
# Load sample information csv
csvfile <- file.path("sample_Info.csv")
sampleTable <- read.csv(csvfile) # read csv
# Load count data (from featureCounts) and convert to matrix
countdata <- read.table("counts.txt", header=TRUE, row.names=1)
countdata <- as.matrix(countdata)
# SVA: Unsupervised calculation of surrogate variables for batch correction------------------------------------------------------
# get counts from counts file
# keep only genes with > 1 counts for all samples
countdata_filtered <- countdata[rowMeans(countdata) > 1, ]
# svaseq
mod1 <- model.matrix(~ sampleTable$Sample_Match + sampleTable$Strain) # full matrix
mod0 <- model.matrix(~ sampleTable$Sample_Match) # null matrix
# calculate number of surrogate variables (6 were found)
num_sv <- num.sv(countdata_filtered, mod1, method ="leek")
print(c("Calculated Number of Significant Surrogate Variables = ", num_sv))
svseq <- svaseq(countdata_filtered, mod1, mod0, n.sv = 2)
# DE Analysis with DESeq-------------------------------------------------------------------------------------------
coldata <- data.frame(row.names=colnames(countdata), svseq$sv[,1], svseq$sv[,2],
factor(sampleTable$Sample_Match), sampleTable$Strain)
colnames(coldata) <- c("unsupervised_sv1","unsupervised_sv2", "Sample_Match", "Strain")
dds <- DESeqDataSetFromMatrix(countData=countdata, colData=coldata,
design= ~ unsupervised_sv1 + unsupervised_sv2
+ Sample_Match + Strain)
dds <- DESeq(dds)
# Generating Contrast and Results Tables from DESeq2-----------------------------------------------------------------------
# Removes any gene in the count matrix with 0 or 1 counts
dds <- dds[ rowSums(counts(dds)) > 1, ]
## ΔRSP1 vs wt ##
res_P1 <- results(dds, contrast=c("Strain","ΔRSP1","wt"))
#summary(res_P1,alpha=0.05)
## ΔRDF2 vs wt ##
res_F2 <- results(dds, contrast=c("Strain","ΔRDF2","wt"))
table(res_F2$padj < 0.05) #total genes with padj<0.05
## ΔRDN2 vs wt ##
res_N2 <- results(dds, contrast=c("Strain","ΔRDN2","wt"))
table(res_N2$padj < 0.05) #total genes with padj<0.05
Here is the sample information from the csv file:
Sample,Strain,Sample_Match
F21,ΔRDF2,1
F22,ΔRDF2,2
F23,ΔRDF2,3
N22,ΔRDN2,4
N23,ΔRDN2,5
P11,ΔRSP1,1
P12,ΔRSP1,2
P13,ΔRSP1,3
SB1,wt,1
SB2,wt,2
SB3,wt,3
SB4,wt,4
SB5,wt,5