Question

DESeq2 throwing error while normalizing raw microarray expression data due to presence of negative values

0

Entering edit mode

22 months ago

accibio ▴ 20

I am trying to download and analyze a miRNA expression dataset from NCBI GEO (GSE25631). I specifically want non-normalized data to perform normalization and my own set of other analyses later on. Accordingly, I downloaded the Series Matrix File (to match the Sample IDs with their phenotype) and the non-normalized data GSE25631_non-normalized_data.txt.gz programmatically using the GEOquery package in R. To normalize the data I'm using the DESeq2 package. When I run the DESeqDataSetFromMatrix function I get this error:

Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative

I went back to the non-normalized data and observed that there are a lot of negative values. I need suggestions on how to deal with this issue. Should I replace all such values with NA? I also observed that every probe has a detection p-value associated with it. Should I also filter out the expression values which have a p-value higher than 0.05?

> head(raw_counts)
    gbm-0001 Detection Pval gbm-0011 Detection Pval gbm-0013 Detection Pval gbm-0024 Detection Pval gbm-0058 Detection Pval
    ILMN_3167304   27.875      0.3071961 -24.9375      0.6689599    14.25       0.350027   4.3125      0.4723163     8.25      0.3966338
    ILMN_3168038  211.875       6.42E-05   1.0625      0.4925717    99.25    0.003645935  60.3125      0.1657064    72.25     0.01086376
    ILMN_3167890 13839.88       3.68E-38 23507.06       3.68E-38 22345.25       3.68E-38 20482.31       3.68E-38 21419.25       3.68E-38
    ILMN_3167526    5.875      0.4577177 105.0625     0.03279042     8.25      0.4117529 -29.6875       0.683706    11.25      0.3604082
    ILMN_3167209   82.875     0.06708142 182.0625      0.0007096    60.25     0.05167192   9.3125      0.4403947    72.25     0.01086376

Here's the R script which I used for all the mentioned steps.

library(GEOquery)
library(DESeq2)
library(tidyverse)


data <- getGEO(GEO = "GSE25631", 
           destdir = "E:\\GSE25631",
           GSEMatrix = TRUE,
           AnnotGPL = FALSE,
           getGPL = FALSE,
           parseCharacteristics = TRUE)

clindata <- data$GSE25631_series_matrix.txt.gz@phenoData@data

url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE25631&format=file&file=GSE25631%5Fnon%2Dnormalized%5Fdata%2Etxt%2Egz"
download.file(url, "E:\\GSE25631\\GSE25631_raw_counts.gz")
raw_counts <-read.delim("E:\\GSE25631\\GSE25631_raw_counts.gz", 
                 stringsAsFactors = FALSE, sep = "\t")
colnames(raw_counts) <- raw_counts[5,]
raw_counts <- raw_counts[-(1:5),]
raw_counts <- raw_counts %>% remove_rownames %>% column_to_rownames(var = "ID_REF")
raw_counts <- raw_counts[ , c(TRUE, FALSE)]


colnames(raw_counts) <- clindata$title
rownames(clindata) <- clindata$title
colnames(clindata)[colnames(clindata) == "disease state:ch1"] <- "Condition"
clindata$Condition[clindata$Condition == "primary glioblastoma multiform (brain tumor)"] <- "GBM"
clindata$Condition[clindata$Condition == "normal"] <- "Normal"
clindata$Condition <- as.factor(clindata$Condition)

dds <- DESeqDataSetFromMatrix(countData = raw_counts,
                      colData = clindata,
                      design = ~ Condition)

geo geoquery microarray normalization deseq2 • 1.1k views

ADD COMMENT • link updated 22 months ago by ATpoint 87k • written 22 months ago by accibio ▴ 20

score 4 · Accepted Answer · 2023-04-20

DESeq2 is meant for digital counts (RNA-seq and similar assays). Arrays are not supported, nor is the implemented method meaningful for array intensity values.

Guides on how to handle these data are probably in here: https://bioconductor.org/packages/release/data/experiment/vignettes/BeadArrayUseCases/inst/doc/BeadArrayUseCases.pdf