desEQ2 eRROR
0
0
Entering edit mode
3.3 years ago
Nai ▴ 50

I have count matrix from feature count with gene_names for each sample. When I am using DESeqDataSetfromMatrix with the following commands:

count <- read.csv("matrix.count", header = TRUE, row.names = 1, sep ='\t')
info <- read.table("coldata2.txt", header = TRUE, sep ='\t')
library(DESeq2)
library(apeglm)
ds <- DESeqDataSetFromMatrix(count, info, ~condition)

Error: DESeqDataSet(se, design = design, ignoreRank) : NA values are not allowed in the count matrix.

any(is.na(count))
TRUE
all(is.numeric(count)
FALSE

I tried the following too:

ds <- DESeqDataSetFromMatrix(count, info, ~condition, tidy=TRUE)
ds <- DESeqDataSetFromMatrix(count, info, ~condition, tidy = FALSE)
 then I got repeatitive ids in row.names.

Kindly provide me the solution.

featurecount NA DESeq2 values • 2.6k views
ADD COMMENT
1
Entering edit mode

You should figure out why there are NA values in your count matrix. You can figure out which genes have NA values using the following code.

# Making some example data.
mat <- replicate(10, sample(c(NA, 1:50), 10))

> mat
      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
 [1,]   16   13   24   33   33   24   27   34   19    27
 [2,]   43    4   31   29   41   27   36   25   10     1
 [3,]    8    8   16   31   22   42    9   30   16    30
 [4,]   20   15   45    5   11   13    4   44    8     7
 [5,]   42   25   13   21   25   33   34   42   48    15
 [6,]    7   42    5   37   40    1   13   31   12     2
 [7,]   28   41   42   30   NA   30   45   16   30    36
 [8,]   37   44   36   47    1   31   50   26   NA    12
 [9,]   NA   32   15   19   23   26   32   29   46    11
[10,]   12   28   30   14   50    9   49   35   32    NA

# Filtering for rows with NA values.
mat[rowSums(is.na(mat)) > 0, ]
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,]   28   41   42   30   NA   30   45   16   30    36
[2,]   37   44   36   47    1   31   50   26   NA    12
[3,]   NA   32   15   19   23   26   32   29   46    11
[4,]   12   28   30   14   50    9   49   35   32    NA

Once you find rows with NA values, get back to us with what you find. It's also important to go through the methods that generated the count matrix to find out where NA could have been introduced.

ADD REPLY
0
Entering edit mode

When i am looking in the data there is no NA, only lot of 0s. NA is only in gene names. I replaced it even I have this error

ADD REPLY
0
Entering edit mode

Hi, count[rowSums(is.na(count)) > 0, ] It showed all column names only.

ADD REPLY
0
Entering edit mode

The problem, as you have inferred, is this:

NA values are not allowed in the count matrix

Please try to understand why there would be NA values in your count data. You may have to review the upstream data processing steps.

If the NA values just need to be 0, then you can impute these via:

count[is.na(count)] <- 0

For further help, please show the output of:

str(count)
str(info)

PS - I am not sure that you need to load apeglm

ADD REPLY
0
Entering edit mode

Dear All,

Thank you. for your support so, I completed to calculate the DEGs prediction. I would like to get the expression value for each sample. while my output is having basemean, log2FoldChange, Padjut, lfcSE, stat. after following each and every step as you guided. How can I reduce number of genes. The commands:

> cts <- read.csv("final_file", sep = "\t", header = TRUE, row.names = 1)
> View(cts)
> dds_new <- DESeqDataSetFromMatrix(cts, info, ~Condition)
Warning message:
In DESeqDataSet(se, design = design, ignoreRank) :
  some variables in design formula are characters, converting to factors
> View(dds_new)
> keep <-rowSums(counts(dds_new)) >= 10
> dds_new <- dds_new[keep,]
> ddsDE_new <- DESeq(dds_new)
estimating size factors
estimating dispersions
gene-wise dispersion estimates
mean-dispersion relationship
final dispersion estimates
fitting model and testing
-- replacing outliers and refitting for 1617 genes
-- DESeq argument 'minReplicatesForReplace' = 7
-- original counts are preserved in counts(dds)
estimating dispersions
fitting model and testing
> normCounts <- counts(ddsDE_new, normalized = T)
> View(normCounts)
> write.csv(normCounts, "cancer_normal_Poutanen.csv")
  results(ddsDE_new, alpha = 0.05)
res_ddsDE_new <- results(ddsDE_new, alpha = 0.05)
summary(res_ddsDE_new)

Then I got 36,000 genes without any information for per sample.

ADD REPLY
0
Entering edit mode

Which per sample information do you want? If the normalised counts, these are accessible via:

counts(dds_new, normalized = TRUE)
ADD REPLY
0
Entering edit mode

ohk. Thank you. For these I got 36000 genes. I want to filter significant genes. I would like to know meaning for alpha = 0.05. How can I filter normalized gene per sample.

ADD REPLY
0
Entering edit mode

Hi, please take a look at the subset() function, e.g.:

subset(res_ddsDE_new, padj < 0.05 & abs(log2FoldChange) > 1)

Then, take the list of genes from the output of the above command (they may be set as rownames()), and use these to further subset the normalised counts.

It helps if you share some sample data; otherwise, we can only hypothesise about how it appears your data.

ADD REPLY
0
Entering edit mode

res_ddsDE_new has 36,000 rows. When I am using subset(res_ddsDE_new, padj < 0.05 & abs(log2FoldChange) > 1) res_ddsDE_new baseMean log2FoldChange <numeric> <numeric> DDX11L1 1.779144 -1.4955939 WASH7P 152.518293 -0.0505911 MIR6859-1 20.653876 0.5689275 MIR1302-2HG 0.255387 -1.9691031 FAM138A 0.353478 0.1574042

Then I will get rownames: 1 (Output of : subset(res_ddsDE_new, padj < 0.05 & abs(log2FoldChange) > 1)

      baseMean       log2FoldChange        lfcSE          stat                 pvalue                   padj

ZWINT 82.3486 -1.74934 0.330334 -5.29568 1.18575e-07 0.0043631

ADD REPLY
0
Entering edit mode

I want significant genes per sample after counts(dds_new, normalized = TRUE) where every gene has expression values, there is no pvalue. then How can I shortlist them.

ADD REPLY
0
Entering edit mode

res_ddsDE_new has 36,000 rows. When I am using subset(res_ddsDE_new, padj < 0.05 & abs(log2FoldChange) > 1) res_ddsDE_new baseMean log2FoldChange <numeric> <numeric> DDX11L1 1.779144 -1.4955939 WASH7P 152.518293 -0.0505911 MIR6859-1 20.653876 0.5689275 MIR1302-2HG 0.255387 -1.9691031 FAM138A 0.353478 0.1574042

Then I will get rownames: 1 (Output of : subset(res_ddsDE_new, padj < 0.05 & abs(log2FoldChange) > 1)

  baseMean       log2FoldChange        lfcSE          stat                 pvalue                   padj

ZWINT 82.3486 -1.74934 0.330334 -5.29568 1.18575e-07 0.0043631

By using subset and filtering of padj and logfold , I did not get significant result. How can I get TPM group 1 and group 2 value for each gene

ADD REPLY

Login before adding your answer.

Traffic: 3333 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6