Odd DESeq2 results
1
0
Entering edit mode
4.4 years ago
mbk0asis ▴ 700

Hi, all!

During differential expression analysis using DESeq2, I found odd results from it; some of genes with many missing data were detected as significant DEGs with very low pvalues as follow.

            log2FoldChange  pvalue  padj    Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Control1    Control2    Control3    Control4    Control5
Actc1   25.34   7.34E-72    4.45E-68    0.00    0.00    0.00    0.00    0.00    796.48  0.00    0.00    0.00    0.00    0.00    132.75  0.00

Sorry, it's not so easy to see hear, but you can see that only one sample out of 11 samples + controls actually has a count values which is unlikely detectable as a significant DEG. I observed many genes in this condition were detected as significant DEGs.

The code to obtain these results is like below,

conds <- c(rep("Sample",6),rep("Control",5))
res <- results(dds, contrast = c("conds", Sample, Control))

How come these genes have so low pvalues and padj so they are detected as DEGs?

Can somebody explain to me?

Thank you!

RNA-Seq DESeq2 • 1.7k views
ADD COMMENT
0
Entering edit mode

Can you show the output of counts(dds[grep("Actc1"), rownames(dds),])?

ADD REPLY
0
Entering edit mode

Thank you, ATpoint! But I got error with your command.

counts(dds[grep("Actc1"), rownames(dds),])
Error in grep("Actc1") : argument "x" is missing, with no default

I found a typo in the code,

counts(dds[grep("Actc1", rownames(dds)),])
    Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Control1    Control2    Control3    Control4    Control5
Actc1   0   0   0   0   0   778 0   0   0   0   0
ADD REPLY
0
Entering edit mode

Sorry for the typo. That is odd, only one sample here has counts whereas in your toplevel example there are also counts for control 4 so it does not really match. Are you sure that there is no parsing error and this p-value and fold change do not belong to a different gene?

ADD REPLY
0
Entering edit mode

Actual data have more samples in complex conditions, I might have made a mistake copying and pasting the data.

Sorry!

For clarity, I'm posting the actual values in my DESeq2 data. I still see the strangely too low pvalues.

> res[grep("Actc1", rownames(res)),]
log2 fold change (MLE): conds WT_PBMC_24m vs WT_PBMC_2m 
Wald test p-value: conds WT_PBMC_24m vs WT_PBMC_2m 
DataFrame with 1 row and 6 columns
              baseMean   log2FoldChange            lfcSE             stat               pvalue                 padj
             <numeric>        <numeric>        <numeric>        <numeric>            <numeric>            <numeric>
Actc1 4434.62314133305 25.3445537653263 1.41381458753046 17.9263631800519 7.34284592381757e-72 4.44584844534075e-68

> dds2 <- dds[,c(79:84,90:94)]
> counts(dds2[grep("Actc1", rownames(dds2)),])
      WT_PBMC_24m_1 WT_PBMC_24m_2 WT_PBMC_24m_3 WT_PBMC_24m_4 WT_PBMC_24m_5 WT_PBMC_24m_6 WT_PBMC_2m_1 WT_PBMC_2m_2 WT_PBMC_2m_3 WT_PBMC_2m_4 WT_PBMC_2m_5
Actc1             0             0             0             0             0           778            0            0            0            0            0

> cntNorm2 <- cntNorm[,c(79:84,90:94)]
> cntNorm2[grep("Actc1", rownames(cntNorm2)),]
WT_PBMC_24m_1 WT_PBMC_24m_2 WT_PBMC_24m_3 WT_PBMC_24m_4 WT_PBMC_24m_5 WT_PBMC_24m_6  WT_PBMC_2m_1  WT_PBMC_2m_2  WT_PBMC_2m_3  WT_PBMC_2m_4  WT_PBMC_2m_5 
        0.000         0.000         0.000         0.000         0.000       796.481         0.000         0.000         0.000         0.000         0.000

Thank you!

ADD REPLY
0
Entering edit mode

Without full code and a complete description of the experiment (at least the colData) I cannot comment any further.

ADD REPLY
0
Entering edit mode

OK. Thank you for spending time for me!

I'll check the codes and data again to see if there's any mistakes.

ADD REPLY
0
Entering edit mode

After carefully checking the data, I found that I got different normalized count values depending on the samples.

I have 36 samples in 5 groups as follow (for simplicity, I changed the sample names to more easily distinguish them),

HD_old_1
HD_old_2
HD_old_3
HD_old_4
HD_old_5
HD_old_6
HD_old_7
HD_old_8
HD_old_9
HD_old_10
HD_young_1
HD_young_2
HD_young_3
HD_young_4
HD_young_5
HD_young_6
HD_young_7
HD_young_8
HD_young_9
HD_young_10
WT_old1_1
WT_old1_2
WT_old1_3
WT_old1_4
WT_old1_5
WT_old1_6
WT_old2_1
WT_old2_2
WT_old2_3
WT_old2_4
WT_old2_5
WT_young_1
WT_young_2
WT_young_3
WT_young_4
WT_young_5

My goal is finding DEGs between WT_old1 and WT_young. When I extracted WT_old1 and WT_young samples and ran DESeq2 only or when all samples were input together, the normalized count values of Actc1 were different.

HD_old_1    HD_old_2    HD_old_3    HD_old_4    HD_old_5    HD_old_6    HD_old_7    HD_old_8    HD_old_9    HD_old_10   HD_young_1  HD_young_2  HD_young_3  HD_young_4  HD_young_5  HD_young_6  HD_young_7  HD_young_8  HD_young_9  HD_young_10 WT_old1_1   WT_old1_2   WT_old1_3   WT_old1_4   WT_old1_5   WT_old1_6   WT_old2_1   WT_old2_2   WT_old2_3   WT_old2_4   WT_old2_5   WT_young_1  WT_young_2  WT_young_3  WT_young_4  WT_young_5
all_raw 21  15  19  15  5   10  22  14  14  15  15  15  0   1   6   0   11  6   1   1   0   0   0   0   0   778 0   0   0   0   0   0   0   0   0   0
all_cntNorm 12.0    23.1    19.0    44.4    13.3    8.1 18.7    22.7    86.8    30.5    27.3    30.7    0.0 10.9    5.5 0.0 161.9   9.6 1.2 1.3 0.0 0.0 0.0 0.0 0.0 536.7   0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
wt_raw  -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   0   0   0   0   0   778 0   0   0   0   0   0   0   0   0   0
wt_cntNorm  -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   -   0.0 0.0 0.0 0.0 0.0 1128.0  0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

I'm not sure if you can see the data above, but the raw counts are the same, while the normalized counts are different.

Interestingly, however, the fold-change values are the same (only p-values were different).

Having different normalized count values depending on the input samples is understandable, but why am I getting so weird p-values?

ADD REPLY
3
Entering edit mode
4.4 years ago

Hi,

There are two things that you can do to correct or minimize what you've obtained:

(1) use the lfcShrink() (from the DESeq2) (have a look into the function doc: https://rdrr.io/bioc/DESeq2/man/lfcShrink.html) to shrink the log-fold changes of genes that are highly variable among replicates or have low read counts and, therefore, the estimates are more prone to uncertainty and error. This function is applied after you get the differential expression results. It will not change the p-value, but will decrease the log-fold change towards zero (the official DESeq2 vignette provide some examples);

(2) you can remove zeros and singletons (genes expressed only once) from your data set, as this may represent sequencing artifacts, contamination etc.

I hope this helps,

António

ADD COMMENT

Login before adding your answer.

Traffic: 1980 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6