I'm trying to learn how to conduct RNA-seq differential expression analysis. I used data from this site, generally this dataset about mammary gland of the mice, samples were collected from two types of cells (basal and luminal) from mice with different "sexual experience" (virgin, lactate, and pregnant) and counts were obtained for each gene in a sample.
I uploaded data in R, made DESeq1 object, designed a formula
info <- read.csv('./SampleInfo.txt', sep='\t')
data <- read.csv('./GSE60450_LactationGenewiseCounts.txt', sep='\t')
info$Status <- relevel(info$Status, ref='virgin')
info$CellType <- relevel(info$CellType, ref='basal')
SampleTable <- data.frame(sex_exp=info$Status, cell=info$CellType)
data_dseq=DESeqDataSetFromMatrix(countData = countdata, colData = SampleTable,
design = ~ cell + sex_exp + cell:sex_exp)
And ran a function that estimate DE
data_DE <- DESeq(data_dseq)
The main question arose when I tried to get specific results about DE. If we call this function
resultsNames(data_DE)
we get
[1] "Intercept" "cell_luminal_vs_basal"
[3] "sex_exp_lactate_vs_virgin" "sex_exp_pregnant_vs_virgin"
[5] "cellluminal.sex_explactate" "cellluminal.sex_exppregnant"
and with these 'names' we can call logFC and corresponding p-adj for genes in our datasets (by function results()). As far as I understand these LogFC and p-adj are from comparisons of specific groups which we define by the formula.
I ask you to estimate if I understand the information from these names in the right way or not; also I still have some empty gaps and I'll appreciate it if you help me to fill them :)
- "Intercept". Because reference levels of factors are 'virgin' and 'basal' this part will contain information about the expression of different genes in basal cells of virgin mice. But I'm not sure with which sample DESeq2 compares this one and what logFC means there.
- "cell_luminal_vs_basal". There we left 'virgin' unchanged and change 'basal' to 'luminal'. It means that logFC describes differences in expression between basal and luminal patterns on gene expression in virgin mice.
- "sex_exp_lactate_vs_virgin" "sex_exp_pregnant_vs_virgin". These two are similar to the previous one but 'basal' doesn't change and virgin changes to lactate and pregnant respectively. It means that in the first case logFC is about the difference in gene expression in basal cells of lactating vs virgin mice (pregnant vs virgin in the second).
- "cellluminal.sex_explactate" "cellluminal.sex_exppregnant". I'm not sure but it seems to me that logFC describes changes in gene expression in luminal cells of lactate and virgin mice in the first case and between pregnant and virgin mice in the second.
Could you please check these definitions? Am I right?
I evolved these statements from regression analysis. In R lm() function output give Intercept (if we have categorical and numeric predictors) which is a value of the variable which we want to predict if all numeric predictors are zero and with the first level of all categorial predictors. I tried to extrapolate it on the DESeq2 output and I'm not sure about of reliability of this.
Thanks for your help and time :)
"As you wrote, the intercept refers the gene expression in the samples corresponding to the reference levels of the factors (virgin & basal in your case). It is not compared to other samples, but compared to 0 (no expression). Log2FC is simply the log2 transformation of the baseMean expression. The pvalue reflects how statistically significant is the expression of the gene compared to 0." Can we use the
intercept
to filter genes which can be called as specific to that condition here in this case itsbasal
? Would that be statistically correct way