I'm using DESEq2 package for obtaining normalized data from raw read counts of RNA-Seq data. I'm not interested to do any downstream analysis like DEGs identification and just want the normalized and transformed expression matrix. In the steps, I also incorporated the following for any batch effect.
Mnemiopsis_count_data = read.table(file = "COUNT_DATA.txt", header = T, sep = "\t")
Mnemiopsis_col_data = read.table(file = "COL_DATA.txt", header = T, sep = "\t")
boxplot(Mnemiopsis_count_data)
hist(Mnemiopsis_count_data[,1]) # Plotting only the first sample (column 1)
pseudoCount = log2(Mnemiopsis_count_data + 1)
boxplot(pseudoCount)
hist(pseudoCount[,1])
dds = DESeqDataSetFromMatrix(countData = Mnemiopsis_count_data, colData = Mnemiopsis_col_data, design = ~ condition) # we're testing for the different condidtions
rld = rlogTransformation(dds)
par( mfrow = c( 1, 2 ) )
plot(log2( 1 + counts(dds)[ , 1:2] ), pch=16, cex=0.3, main = "log2")
plot(assay(rld)[ , 1:2],pch=16, cex=0.3, main = "rlog")
dds = estimateSizeFactors(dds)
sizeFactors(dds)
plotPCA(rld, intgroup=c("condition"))
plot(pseudoCount[,4], pseudoCount[,7], pch = 20, xlab = "Aboral 4", ylab = "Oral 3") + abline(0,1, col = "red")
vsd <- vst(dds)
assay(vsd) <- limma::removeBatchEffect(assay(vsd), vsd$batch)
plotPCA(vsd)
dds = DESeq(dds)
res <- results(dds)
mcols(res, use.names=TRUE)
summary(res)
norm_counts = counts(dds, normalized = TRUE) # Extract the normalized counts
I have some queries regarding batch correction:
1) Does the vsd object contains our normalized expression matrix and how can I save it in R? 2) If I'm using rlog instead of vst for transformation, then also do I need a batch correction? 3) Does batch correction changes our normalized and transformed expression matrix? If yes, what does it do? 4) Does the above script indicates the batch effects or does it corrects them also? Please share R codes if any modifications are required. 5) Also, do I have to increment the normalized data by +1 before saving?
Thanks
What is the content of
vsd$batch
? I cannot see that from the code. It should be a vector that indicates which sample belongs to which batch. This is somewhat related to:The script visualizes a potential batch via the PCA you should use, together with information from the wetlab such as the day of library prep etc. to check whether variation across datapoints and biological replicates could be due to a batch effect. The PCA after the batch correction with limma can then be used to check if this was successful. Is this your script?
Thank you sir for your prompt response. No this is not my script. Its an example script from DESEq2 page which I ran for understanding the concept. the vsd$batch is showing NULL. what does this mean?
It does not seem that you are following the DESeq2 workflow / vignette - can you confirm which one you are following? As an example, it makes little sense to create and use your
pseudoCount
object, as it is just logged raw counts + 1 (?).Which variable contains your batch information? - where is it stored?
For all intents and purposes, this is a simple workflow that you could use, assuming that
batch
is a variable in your colData:Thank you sir for your help. But how can i save expression data that has been batch corrected after running the last command: assay(vsd) <- limma::removeBatchEffect(assay(vsd), colData(dds)[['batch']])
Take a look at the
write.table()
function. The expression data itself will be stored inassay(vsd)
Sir after running this step
Is this step relevant?
This is getting a bit beyond the scope of biostars.
log2()
is a basic R function, please get more background before doing analysis in R, and most importantly, please use online tutorials such as the DESeq2 manual to learn how to do RNA-seq analyssis and which steps are necessary and why. The idea of this forum is not spoon-feeding.I completely understand the idea of the respective forum sir, but i'm a complete beginner in R and trying to dig in the concepts. Help from your end would be a great motivation for me.
i framed this code after reading:
I have applied vst transormation. Could you please tell me if the vaues in vsd are normalized or not? or do we have to perform normalization.
In your above code,
vsd
will contain normalised and transformed expression levels - yes.Thanks, Kevin. Before creating the DESeQ object i read my data like this.
where Col_data contains the sample information and Count_data contains expression data where the first column has the genes. I read the expression data without genes and assigned to variable CC. This was done because the samples in col data and count data must have the same dimensions
The vsd variable has normalized and log-transformed values without genes. Please help how can I get the genes with data also.