Dear community,
I am trying to process Agilent Microarray dataset (Accession Number: GSE11223) in R. Basically I was using this approach: How to process (seems) Agilent microarrry data?
Unfortunately, there seems to be something wrong with the results and a couple of questions arose: 1) is it plausible, that when I collapse the duplicates to gene_symbols, there are only 12k remaining although there should be >20k annotated genes 2) if I got to right, each probe should be gIsWellAboveBG > 0 not only 3 but 4 times. If I apply the threshold 4, no samples remain. Is there a problem with the array data?
IsExpr <- rowSums(project_norm$other$gIsWellAboveBG > 0) >= 3
3) The results do not make any sense, either. E.g. Plotting the p values shows only highly significant values and no distribution as expected.
Here is my code. I used the files from the GEOmnibus GSE11223 (annotation, pheno and raw files)
project <- read.maimages( targets,
source = 'agilent.median',
green.only = TRUE,
other.columns = 'gIsWellAboveBG')
colnames(project) <- gsub('raw\\/', '', colnames(project))
# (QC) ----
# edit feature (annotation) data
annotLookup <- read.table(
annotfile,
header = FALSE,
sep = '\t',
stringsAsFactors = FALSE,
comment.char = "#",
fill = TRUE)
colnames(annotLookup) <- annotLookup[2,]
colnames(annotLookup)[1] <- 'AgilentID'
annotLookup <- annotLookup[c(-1,-2),]
annotLookup <- annotLookup[which(annotLookup$SPOT_ID %in% project$genes$ProbeName),]
annotLookup <- annotLookup[match(project$genes$ProbeName, annotLookup$SPOT_ID),]
table(project$genes$ProbeName == annotLookup$SPOT_ID) # check that annots are aligned
project$genes$AgilentID <- annotLookup$SPOT_ID
project$genes$DESCRIPTION <- annotLookup$DESCRIPTION
project$genes$ENSEMBL_ID <- annotLookup$ENSEMBL_ID
project$genes$ENTREZ_ID <- annotLookup$GENE
project$genes$gene_symbol <- annotLookup$GENE_SYMBOL
project$genes$gene_name <- annotLookup$GENE_NAME
# background correction
project_bgcorrect <- backgroundCorrect(project, method = 'normexp')
# normalize the data with the 'quantile' method
project_norm <- normalizeBetweenArrays(project_bgcorrect, method = 'quantile')
# filter out control probes, those with no symbol, and those that fail:
Control <- project_norm$genes$ControlType == 1 | project_norm$genes$ControlType == -1
NoSymbol <- is.na(project_norm$genes$external_gene_name)
sum(NoSymbol) # remove these when positive value
IsExpr <- rowSums(project_norm$other$gIsWellAboveBG > 0) >= 3 # maximum is 4, but most of the values are only 3x positive
project_norm_filtered <- project_norm[!Control & IsExpr, ]
project_norm_filtered <- project_norm[IsExpr, ]
dim(project_norm)
dim(project_norm_filtered)
colnames(project_norm_filtered$genes)
# remove annotation columns we no longer need
project_norm_filtered$genes <- project_norm_filtered$genes[,c('AgilentID',
'ProbeName','DESCRIPTION','ENSEMBL_ID','ENTREZ_ID','gene_symbol','gene_name')]
head(project_norm_filtered$genes)
# collaps duplicates (1. OPTION) ----
A <- rowMeans(project_norm_filtered$E)
o <- order(A, decreasing=TRUE)
y <- project_norm_filtered[o,]
d <- duplicated(y$genes$gene_symbol)
y <- y[!d,]
dim(y)
dim(project_norm_filtered)
# ALTERNATIVE: Averaging of probe duplicates (=2. OPTION)
raw_aver = avereps(project_norm_filtered,
ID = project_norm_filtered$genes$AgilentID)
# filter for samples of interest & annotated genes only (ENSEMBL_ID, gene_symbol (add!))
y_filtered <- y[,y$targets$Group == "0"|y$targets$Group == "DISEASE IN REMISSION"]
# alternative: y_filtered <- raw_aver[,raw_aver$targets$Group == "0"|raw_aver$targets$Group == "DISEASE IN REMISSION"]
# model design matrix and log model... ----
design <- model.matrix(~0 + factor(y_filtered$targets$Group))
colnames(design) <- levels(factor(y_filtered$targets$Group))
fit <- lmFit(y_filtered, design)
efit<- eBayes(fit)
Session Info:
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods base
other attached packages:
[1] RobustRankAggreg_1.2.1 hgu133plus2cdf_2.18.0 arrayQualityMetrics_3.58.0
[4] edgeR_4.0.16 dbplyr_2.5.0 GEOquery_2.70.0
[7] plotly_4.10.4 clusterProfiler_4.10.1 enrichplot_1.22.0
[10] org.Hs.eg.db_3.18.0 AnnotationDbi_1.64.1 DExMA_1.10.7
[13] DExMAdata_1.10.0 BiocManager_1.30.22 affy_1.80.0
[16] biomaRt_2.58.2 gwasrapidd_0.99.17 openxlsx_4.2.5.2
[19] RColorBrewer_1.1-3 pheatmap_1.0.12 EnhancedVolcano_1.20.0
[22] ggrepel_0.9.5 limma_3.58.1 Glimma_2.12.0
[25] readxl_1.4.3 lubridate_1.9.3 forcats_1.0.0
[28] stringr_1.5.1 dplyr_1.1.4 purrr_1.0.2
[31] readr_2.1.5 tidyr_1.3.1 tibble_3.2.1
[34] ggplot2_3.5.0 tidyverse_2.0.0 DESeq2_1.42.1
[37] SummarizedExperiment_1.32.0 Biobase_2.62.0 MatrixGenerics_1.14.0
[40] matrixStats_1.2.0 GenomicRanges_1.54.1 GenomeInfoDb_1.38.8
[43] IRanges_2.36.0 S4Vectors_0.40.2 BiocGenerics_0.48.1
loaded via a namespace (and not attached): ...
Thank you very much for the response. Following your recommendation in pursuing option one by selecting highest expressions and subsequent collapsing to gene symbols.
only around 11k of gene symbols remain.
I used this for my annotation: GPL1708_old_annotations.txt.gz and it should be a whole genome coverage respectively approx. 19k of gene symbols (when I collapse duplicated gene symbols in this file). So how come I end up with only 11k?
I actually advised you not to remove duplicate probes, especially as you seem to be use probe annotation that is very old and possibly out of date.
Dear Gordon, excuse me for not being specific regarding your advice. What would be the best way if I wanted to add new annotations to the probes - getting the information via biomart() with the attribute 'agilent_wholegenome_4x44k_v1'? My assumption that the annotation which can be downloaded directly with the GSE series was probably false.