DESeq2 throwing error while normalizing raw microarray expression data due to presence of negative values
1
0
Entering edit mode
19 months ago
accibio ▴ 20

I am trying to download and analyze a miRNA expression dataset from NCBI GEO (GSE25631). I specifically want non-normalized data to perform normalization and my own set of other analyses later on. Accordingly, I downloaded the Series Matrix File (to match the Sample IDs with their phenotype) and the non-normalized data GSE25631_non-normalized_data.txt.gz programmatically using the GEOquery package in R. To normalize the data I'm using the DESeq2 package. When I run the DESeqDataSetFromMatrix function I get this error:

Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative

I went back to the non-normalized data and observed that there are a lot of negative values. I need suggestions on how to deal with this issue. Should I replace all such values with NA? I also observed that every probe has a detection p-value associated with it. Should I also filter out the expression values which have a p-value higher than 0.05?

> head(raw_counts)
    gbm-0001 Detection Pval gbm-0011 Detection Pval gbm-0013 Detection Pval gbm-0024 Detection Pval gbm-0058 Detection Pval
    ILMN_3167304   27.875      0.3071961 -24.9375      0.6689599    14.25       0.350027   4.3125      0.4723163     8.25      0.3966338
    ILMN_3168038  211.875       6.42E-05   1.0625      0.4925717    99.25    0.003645935  60.3125      0.1657064    72.25     0.01086376
    ILMN_3167890 13839.88       3.68E-38 23507.06       3.68E-38 22345.25       3.68E-38 20482.31       3.68E-38 21419.25       3.68E-38
    ILMN_3167526    5.875      0.4577177 105.0625     0.03279042     8.25      0.4117529 -29.6875       0.683706    11.25      0.3604082
    ILMN_3167209   82.875     0.06708142 182.0625      0.0007096    60.25     0.05167192   9.3125      0.4403947    72.25     0.01086376

Here's the R script which I used for all the mentioned steps.

library(GEOquery)
library(DESeq2)
library(tidyverse)


data <- getGEO(GEO = "GSE25631", 
           destdir = "E:\\GSE25631",
           GSEMatrix = TRUE,
           AnnotGPL = FALSE,
           getGPL = FALSE,
           parseCharacteristics = TRUE)

clindata <- data$GSE25631_series_matrix.txt.gz@phenoData@data

url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE25631&format=file&file=GSE25631%5Fnon%2Dnormalized%5Fdata%2Etxt%2Egz"
download.file(url, "E:\\GSE25631\\GSE25631_raw_counts.gz")
raw_counts <-read.delim("E:\\GSE25631\\GSE25631_raw_counts.gz", 
                 stringsAsFactors = FALSE, sep = "\t")
colnames(raw_counts) <- raw_counts[5,]
raw_counts <- raw_counts[-(1:5),]
raw_counts <- raw_counts %>% remove_rownames %>% column_to_rownames(var = "ID_REF")
raw_counts <- raw_counts[ , c(TRUE, FALSE)]


colnames(raw_counts) <- clindata$title
rownames(clindata) <- clindata$title
colnames(clindata)[colnames(clindata) == "disease state:ch1"] <- "Condition"
clindata$Condition[clindata$Condition == "primary glioblastoma multiform (brain tumor)"] <- "GBM"
clindata$Condition[clindata$Condition == "normal"] <- "Normal"
clindata$Condition <- as.factor(clindata$Condition)

dds <- DESeqDataSetFromMatrix(countData = raw_counts,
                      colData = clindata,
                      design = ~ Condition)
geo geoquery microarray normalization deseq2 • 984 views
ADD COMMENT
4
Entering edit mode
19 months ago
ATpoint 85k

DESeq2 is meant for digital counts (RNA-seq and similar assays). Arrays are not supported, nor is the implemented method meaningful for array intensity values.

Guides on how to handle these data are probably in here: https://bioconductor.org/packages/release/data/experiment/vignettes/BeadArrayUseCases/inst/doc/BeadArrayUseCases.pdf

ADD COMMENT

Login before adding your answer.

Traffic: 2718 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6