I am trying to download and analyze a miRNA expression dataset from NCBI GEO (GSE25631)
. I specifically want non-normalized data to perform normalization and my own set of other analyses later on. Accordingly, I downloaded the Series Matrix File
(to match the Sample IDs with their phenotype) and the non-normalized data GSE25631_non-normalized_data.txt.gz
programmatically using the GEOquery
package in R. To normalize the data I'm using the DESeq2
package. When I run the DESeqDataSetFromMatrix
function I get this error:
Error in DESeqDataSet(se, design = design, ignoreRank) : some values in assay are negative
I went back to the non-normalized data and observed that there are a lot of negative values. I need suggestions on how to deal with this issue. Should I replace all such values with NA? I also observed that every probe has a detection p-value associated with it. Should I also filter out the expression values which have a p-value higher than 0.05?
> head(raw_counts)
gbm-0001 Detection Pval gbm-0011 Detection Pval gbm-0013 Detection Pval gbm-0024 Detection Pval gbm-0058 Detection Pval
ILMN_3167304 27.875 0.3071961 -24.9375 0.6689599 14.25 0.350027 4.3125 0.4723163 8.25 0.3966338
ILMN_3168038 211.875 6.42E-05 1.0625 0.4925717 99.25 0.003645935 60.3125 0.1657064 72.25 0.01086376
ILMN_3167890 13839.88 3.68E-38 23507.06 3.68E-38 22345.25 3.68E-38 20482.31 3.68E-38 21419.25 3.68E-38
ILMN_3167526 5.875 0.4577177 105.0625 0.03279042 8.25 0.4117529 -29.6875 0.683706 11.25 0.3604082
ILMN_3167209 82.875 0.06708142 182.0625 0.0007096 60.25 0.05167192 9.3125 0.4403947 72.25 0.01086376
Here's the R script which I used for all the mentioned steps.
library(GEOquery)
library(DESeq2)
library(tidyverse)
data <- getGEO(GEO = "GSE25631",
destdir = "E:\\GSE25631",
GSEMatrix = TRUE,
AnnotGPL = FALSE,
getGPL = FALSE,
parseCharacteristics = TRUE)
clindata <- data$GSE25631_series_matrix.txt.gz@phenoData@data
url = "https://www.ncbi.nlm.nih.gov/geo/download/?acc=GSE25631&format=file&file=GSE25631%5Fnon%2Dnormalized%5Fdata%2Etxt%2Egz"
download.file(url, "E:\\GSE25631\\GSE25631_raw_counts.gz")
raw_counts <-read.delim("E:\\GSE25631\\GSE25631_raw_counts.gz",
stringsAsFactors = FALSE, sep = "\t")
colnames(raw_counts) <- raw_counts[5,]
raw_counts <- raw_counts[-(1:5),]
raw_counts <- raw_counts %>% remove_rownames %>% column_to_rownames(var = "ID_REF")
raw_counts <- raw_counts[ , c(TRUE, FALSE)]
colnames(raw_counts) <- clindata$title
rownames(clindata) <- clindata$title
colnames(clindata)[colnames(clindata) == "disease state:ch1"] <- "Condition"
clindata$Condition[clindata$Condition == "primary glioblastoma multiform (brain tumor)"] <- "GBM"
clindata$Condition[clindata$Condition == "normal"] <- "Normal"
clindata$Condition <- as.factor(clindata$Condition)
dds <- DESeqDataSetFromMatrix(countData = raw_counts,
colData = clindata,
design = ~ Condition)