Limma Analysis Agilent Microarray Data (GPL1708)
1
0
Entering edit mode
8 months ago
hagl ▴ 20

Dear community,

I am trying to process Agilent Microarray dataset (Accession Number: GSE11223) in R. Basically I was using this approach: How to process (seems) Agilent microarrry data?

Unfortunately, there seems to be something wrong with the results and a couple of questions arose: 1) is it plausible, that when I collapse the duplicates to gene_symbols, there are only 12k remaining although there should be >20k annotated genes 2) if I got to right, each probe should be gIsWellAboveBG > 0 not only 3 but 4 times. If I apply the threshold 4, no samples remain. Is there a problem with the array data?

IsExpr <- rowSums(project_norm$other$gIsWellAboveBG > 0) >= 3

3) The results do not make any sense, either. E.g. Plotting the p values shows only highly significant values and no distribution as expected.

Here is my code. I used the files from the GEOmnibus GSE11223 (annotation, pheno and raw files)

  project <- read.maimages(      targets,
  source = 'agilent.median',
  green.only = TRUE,
  other.columns = 'gIsWellAboveBG')
colnames(project) <- gsub('raw\\/', '', colnames(project))

# (QC) ---- 

# edit feature (annotation) data 

annotLookup <- read.table(
  annotfile,
  header = FALSE,
  sep = '\t',
  stringsAsFactors = FALSE, 
  comment.char = "#", 
  fill = TRUE)

colnames(annotLookup) <- annotLookup[2,]
colnames(annotLookup)[1] <- 'AgilentID'
annotLookup <- annotLookup[c(-1,-2),]

annotLookup <- annotLookup[which(annotLookup$SPOT_ID %in% project$genes$ProbeName),]
annotLookup <- annotLookup[match(project$genes$ProbeName, annotLookup$SPOT_ID),]
table(project$genes$ProbeName == annotLookup$SPOT_ID) # check that annots are aligned
project$genes$AgilentID <- annotLookup$SPOT_ID
project$genes$DESCRIPTION <- annotLookup$DESCRIPTION
project$genes$ENSEMBL_ID <- annotLookup$ENSEMBL_ID
project$genes$ENTREZ_ID <- annotLookup$GENE
project$genes$gene_symbol <- annotLookup$GENE_SYMBOL
project$genes$gene_name <- annotLookup$GENE_NAME

# background correction

project_bgcorrect <- backgroundCorrect(project, method = 'normexp')

# normalize the data with the 'quantile' method

project_norm <- normalizeBetweenArrays(project_bgcorrect, method = 'quantile')

# filter out control probes, those with no symbol, and those that fail:

Control <- project_norm$genes$ControlType == 1 | project_norm$genes$ControlType == -1
NoSymbol <- is.na(project_norm$genes$external_gene_name)
sum(NoSymbol) # remove these when positive value 
IsExpr <- rowSums(project_norm$other$gIsWellAboveBG > 0) >= 3 # maximum is 4, but most of the values are only 3x positive
project_norm_filtered <- project_norm[!Control & IsExpr, ]

project_norm_filtered <- project_norm[IsExpr, ]
dim(project_norm)
dim(project_norm_filtered)

colnames(project_norm_filtered$genes)

# remove annotation columns we no longer need

project_norm_filtered$genes <- project_norm_filtered$genes[,c('AgilentID',
  'ProbeName','DESCRIPTION','ENSEMBL_ID','ENTREZ_ID','gene_symbol','gene_name')]
head(project_norm_filtered$genes)

# collaps duplicates (1. OPTION) ----

A <- rowMeans(project_norm_filtered$E)
o <- order(A, decreasing=TRUE)
y <- project_norm_filtered[o,]
d <- duplicated(y$genes$gene_symbol)
y <- y[!d,]

dim(y)

dim(project_norm_filtered)

# ALTERNATIVE: Averaging of probe duplicates (=2. OPTION)

raw_aver = avereps(project_norm_filtered,
            ID = project_norm_filtered$genes$AgilentID)

# filter for samples of interest & annotated genes only (ENSEMBL_ID, gene_symbol (add!))

y_filtered <- y[,y$targets$Group == "0"|y$targets$Group == "DISEASE IN REMISSION"]

# alternative: y_filtered <- raw_aver[,raw_aver$targets$Group == "0"|raw_aver$targets$Group == "DISEASE IN REMISSION"]

# model design matrix and log model... ---- 

design <- model.matrix(~0 + factor(y_filtered$targets$Group))

colnames(design) <- levels(factor(y_filtered$targets$Group))

fit <- lmFit(y_filtered, design)
efit<- eBayes(fit)

Session Info:

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] RobustRankAggreg_1.2.1      hgu133plus2cdf_2.18.0       arrayQualityMetrics_3.58.0 
 [4] edgeR_4.0.16                dbplyr_2.5.0                GEOquery_2.70.0            
 [7] plotly_4.10.4               clusterProfiler_4.10.1      enrichplot_1.22.0          
[10] org.Hs.eg.db_3.18.0         AnnotationDbi_1.64.1        DExMA_1.10.7               
[13] DExMAdata_1.10.0            BiocManager_1.30.22         affy_1.80.0                
[16] biomaRt_2.58.2              gwasrapidd_0.99.17          openxlsx_4.2.5.2           
[19] RColorBrewer_1.1-3          pheatmap_1.0.12             EnhancedVolcano_1.20.0     
[22] ggrepel_0.9.5               limma_3.58.1                Glimma_2.12.0              
[25] readxl_1.4.3                lubridate_1.9.3             forcats_1.0.0              
[28] stringr_1.5.1               dplyr_1.1.4                 purrr_1.0.2                
[31] readr_2.1.5                 tidyr_1.3.1                 tibble_3.2.1               
[34] ggplot2_3.5.0               tidyverse_2.0.0             DESeq2_1.42.1              
[37] SummarizedExperiment_1.32.0 Biobase_2.62.0              MatrixGenerics_1.14.0      
[40] matrixStats_1.2.0           GenomicRanges_1.54.1        GenomeInfoDb_1.38.8        
[43] IRanges_2.36.0              S4Vectors_0.40.2            BiocGenerics_0.48.1  

loaded via a namespace (and not attached): ...

Microarray Limma Agilent • 800 views
ADD COMMENT
1
Entering edit mode
8 months ago
Gordon Smyth ★ 7.8k

There is an Agilent array case study in the limma User's Guide, which you might find helpful and somewhat simpler.

I don't recommend collapsing Agilent probes down to genes but, if you do, I don't recommend averaging. The first option is much better.

The reason why your p-values are too small is that you haven't formed any comparisons between the disease groups. See the User's Guide.

ADD COMMENT
0
Entering edit mode

Thank you very much for the response. Following your recommendation in pursuing option one by selecting highest expressions and subsequent collapsing to gene symbols.

only around 11k of gene symbols remain.

d <- duplicated(y$genes$gene_symbol)
> table(d)
d
FALSE  TRUE 
11960 29345 

summary(is.na(y$genes$gene_symbol))
   Mode   FALSE    TRUE 
logical   22025   19280 

I used this for my annotation: GPL1708_old_annotations.txt.gz and it should be a whole genome coverage respectively approx. 19k of gene symbols (when I collapse duplicated gene symbols in this file). So how come I end up with only 11k?

ADD REPLY
0
Entering edit mode

I actually advised you not to remove duplicate probes, especially as you seem to be use probe annotation that is very old and possibly out of date.

ADD REPLY
0
Entering edit mode

Dear Gordon, excuse me for not being specific regarding your advice. What would be the best way if I wanted to add new annotations to the probes - getting the information via biomart() with the attribute 'agilent_wholegenome_4x44k_v1'? My assumption that the annotation which can be downloaded directly with the GSE series was probably false.

ADD REPLY

Login before adding your answer.

Traffic: 3430 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6