Hello. I'm currently studying RNA-seq on my own and have a part that I don't understand. Please let me know if I've violated any rules of this forum, and I'll take appropriate action. Since I lived in a non-English speaking country, please understand that my English may not be fluent.
First, I'm studying plants (rice). I obtained RNA-sequencing datas for a total of six samples, with three biological replicates each for Wild-type and mutant plant. I used Kallisto for mapping and quantification. After quantification, I obtained abundance.h5 and abundance.tsv files for each sample.
To extract the list of Differentially Expressed Genes (DEGs), I used the DESeq2 function in R. I used the abundance.h5 file for analysis. However, I'm not clear on the concept of normalization (I understand what normalization is and why it is necessary). I've read the DESeq() manual. I understand that DESeq() requires raw data, and since it contains an internal normalization model, I don't need to perform additional normalization.. But I'm unsure if that's true.
Do I need to perform additional normalization when using DESeq2? The tsv file contains both TPM and est_counts. If using abundance.h5 data, which data type should be input into DESeq(): TPM or est_counts?
Here is the R code I have used. Please let me know if there are any mistakes. What I aim to obtain is the list of DEGs (Differentially Expressed Genes) between the wild-type (WT) and mutant samples.
##
info <- read.csv('metadata.txt', sep='\t')
files <- paste('./s1', list.files(path='./s1', pattern = 'abundance.h5', recursive = TRUE), sep='/')
names(files) <- info$Sample.Name
tx2gene <- read.csv('ref.cdna.csv')
kallisto.h5 <- tximport::tximport(files, type = "kallisto", tx2gene = tx2gene, ignoreAfterBar = TRUE)
mr = mr %>% mutate(Sample.Name = as.factor(Sample.Name))
dds <- DESeqDataSetFromTximport(kallisto.h5, info, ~Sample.Name)
dds$Sample.Name <- relevel(dds$Sample.Name, ref = 'WT')
#DESeq2
dds <- DESeq(dds)
res <- results(dds)
S1 <- results(dds, name="Sample.Name_MT_vs_WT")
S1d <- as.data.frame(S1)
S1f0.05 <- S1d[which(S1d$padj < 0.05 & abs(S1d$log2FoldChange) >= 2),]
write.csv(S1f0.05,file="deseqS1.csv") #This is the file I need.
#Normalization #How do I utilize this file?
S1_normalized<-counts(dds, normalized=TRUE)
write.table(S1_normalized, "S1_norm.txt", sep="\t")
Thank you for your response. I have often seen commands like dds <- estimateSizeFactors(dds) normalized_counts <- counts(dds, normalized=TRUE)
Why are these necessary?
Also, I've observed vsd_trans <- vst(dds, blind = TRUE) being used for PCA analysis and heatmap clustering.
Is additional normalization required for PCA analysis and heatmap clustering beyond the normalization model within DESeq2? I apologize for not being well-versed in RNA-seq. I'm studying it on my own, and I expect your response will be very helpful.
The
DESeq
command already does the necessary work that you'll need for your analysis (including normalization); it gets you your log2FoldChange and padj.If you want to do other types of analyses beyond what the
DESeq
command offers -- like PCA and heatmap clustering, you'd manually extract the normalized_counts and use those. vst is a special way of doing a log2-transform (you can read up on it). You generally want to do downstream analysis on counts that have been "normalized" (so you can compare between samples) and log-transformed (otherwise your heatmap will look like crap; you can look up why we log-transform things).Anyway, I hope that clears up everything -- do a search for additional questions :)