Question

How to assess quality of publicly available microarray data

0

Entering edit mode

16 months ago

Will • 0

Hello everyone,

I am relatively new to microarray technology and gene expression analysis. I am using publicly available microarray data and I want to know if the data is quality enough for a larger analysis I am doing.

I read on the expression console user guide that the hybridization control probes ought to follow a pattern of increasing expression depending on the probeset. For example, the AFFX-r2-Ec-BioB probe should be expressed less than the AFFX-r2-Ec-BioC if the correct protocol is followed. However, when processing data for different Affymetrix HGU133 plus 2 arrays in RStudio I found that this pattern did not hold for one of the arrays but did for another. I included boxplots of the RMA normalized expression for two arrays below. (one was in base R and the other ggplot, my apologies)

boxplot of a single microarray hybridization probes from study A

How should I interpret this information as it relates to what's happening on the microarray? Is there a standard method of checking the quality of a microarray from publicly available CEL files? Any insight is appreciated.

control probes affymetrix • 626 views

ADD COMMENT • link 16 months ago by Will • 0

0

Entering edit mode

Hi Will, please share the code that you used to access the data, and also the code to generate the plots. Note that I can infer that the study number is GSE16028; moreover, I can immediately see that the authors provided data 'normalised' via MAS5, not RMA.

ADD REPLY • link 16 months ago by Kevin Blighe 88k

0

Entering edit mode

Sure. I downloaded the CEL files from GEO, in the case of the first plot its GSE16028. Of note, the authors claim to have processed two batches of microarrays, so I subsetted them locally. However, the boxplots appear similarly across all microarrays for this particular study.

setwd("/Users/willbaltazar/Desktop/work_repo")
celpath <- paste0(getwd(),"/data/GSE16028_RAW/batch1")
class(celpath)
data <- ReadAffy(celfile.path = celpath)

# Normalize the data into an eSet
data <- rma(data)

# Get expression estimates
normalized_expression <- exprs(data)

I used the getGEO function to pull the pheno data. Here's the code I used to look at these probes:

> index <- grep(pattern = "cre|bioC|bioD|bioB", x = row.names(normalized_expression))
> head(index)
[1] 54661 54662 54663 54664 54665 54666
> normalized_expression[index,1:5]
                     GSM401059 GSM401060 GSM401061 GSM401064 GSM401065
AFFX-r2-Ec-bioB-3_at  3.207902  3.490101  3.135174  3.263243  3.256381
AFFX-r2-Ec-bioB-5_at  2.451986  2.329745  2.637099  2.563886  2.500182
AFFX-r2-Ec-bioB-M_at  3.008145  2.866189  2.840487  2.962931  2.812796
AFFX-r2-Ec-bioC-3_at  2.734562  2.871741  3.114969  3.227851  3.206967
AFFX-r2-Ec-bioC-5_at  2.599714  2.735176  2.528859  2.336966  2.597684
AFFX-r2-Ec-bioD-3_at  1.977988  2.055390  2.027907  2.000599  2.323826
AFFX-r2-Ec-bioD-5_at  2.784274  2.517126  2.489869  2.505398  2.567285
AFFX-r2-P1-cre-3_at   2.566813  2.436951  2.209591  2.132996  2.330542
AFFX-r2-P1-cre-5_at   2.376335  2.209607  2.235569  1.998427  2.075194
> # Plot hybridization probe expression
> hybrids <- as_tibble(normalized_expression[index,], rownames = "probe")
> for (i in 1:27) {
+   if (str_detect(hybrids[i,1], "cre")) {
+     hybrids[i,1] <- "cre"
+   } else if (str_detect(hybrids[i,1], "bioC")) {
+     hybrids[i,1] <- "bioC"
+   } else if (str_detect(hybrids[i,1], "bioD")) {
+     hybrids[i,1] <- "bioD"
+   } else {
+     hybrids[i,1] <- "bioB"
+   }
+ }
Error in if (str_detect(hybrids[i, 1], "cre")) { : 
  missing value where TRUE/FALSE needed

The error should have no impact on the structure needed for testing each microarray, which I show by previewing the table

> hybrids
# A tibble: 9 × 61
  probe GSM401059 GSM401060 GSM401061 GSM401064 GSM401065 GSM401066 GSM401069 GSM401070 GSM401071 GSM401074
  <chr>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>
1 bioB       3.21      3.49      3.14      3.26      3.26      3.31      3.43      3.13      3.29      3.18
2 bioB       2.45      2.33      2.64      2.56      2.50      2.70      2.71      2.68      2.62      2.62
3 bioB       3.01      2.87      2.84      2.96      2.81      2.86      2.98      2.94      2.81      2.99
4 bioC       2.73      2.87      3.11      3.23      3.21      3.29      3.14      3.05      3.16      3.16
5 bioC       2.60      2.74      2.53      2.34      2.60      2.44      2.63      2.51      2.47      2.43
6 bioD       1.98      2.06      2.03      2.00      2.32      2.13      2.13      2.08      2.12      2.31
7 bioD       2.78      2.52      2.49      2.51      2.57      2.44      2.33      2.66      2.49      2.56
8 cre        2.57      2.44      2.21      2.13      2.33      2.32      2.26      2.10      2.26      2.17
9 cre        2.38      2.21      2.24      2.00      2.08      2.41      2.44      2.26      2.14      2.15
#  50 more variables: GSM401075 <dbl>, GSM401076 <dbl>, GSM401079 <dbl>, GSM401080 <dbl>, GSM401081 <dbl>,
#   GSM401089 <dbl>, GSM401090 <dbl>, GSM401091 <dbl>, GSM401094 <dbl>, GSM401095 <dbl>, GSM401096 <dbl>,
#   GSM401099 <dbl>, GSM401100 <dbl>, GSM401101 <dbl>, GSM401104 <dbl>, GSM401105 <dbl>, GSM401106 <dbl>,
#   GSM401109 <dbl>, GSM401110 <dbl>, GSM401111 <dbl>, GSM401114 <dbl>, GSM401115 <dbl>, GSM401116 <dbl>,
#   GSM401119 <dbl>, GSM401120 <dbl>, GSM401121 <dbl>, GSM401124 <dbl>, GSM401125 <dbl>, GSM401126 <dbl>,
#   GSM401129 <dbl>, GSM401130 <dbl>, GSM401131 <dbl>, GSM401134 <dbl>, GSM401135 <dbl>, GSM401136 <dbl>,
#   GSM401139 <dbl>, GSM401140 <dbl>, GSM401141 <dbl>, GSM401144 <dbl>, GSM401145 <dbl>, GSM401146 <dbl>, …

Next, I ran a test to see if the probes were in the expected order, then stored arrays which didn't meet the criteria into a list which I can graph at will.

> hybrids$probe <- factor(hybrids$probe, levels = c("bioB", "bioC", "bioD", "cre"))
> hybridsresults <- as.data.frame(colnames(hybrids)[-1])
> hybridsresults$test <- rep(NA)
> graphs <- list()
> for (i in 1:100) {
+   hybrids_test <- hybrids %>% 
+     select(probe, expr = colnames(hybrids)[i+1]) %>% 
+     group_by(probe) %>% 
+     summarise(mean = mean(expr)) %>% 
+     arrange(mean)
+   if (identical(as.vector(hybrids_test$probe), c("bioB", "bioC", "bioD", "cre"))) {
+     hybridsresults$test[i] <- identical(as.vector(hybrids_test$probe), c("bioB", "bioC", "bioD", "cre"))
+   } else {
+     hybridsresults$test[i] <- identical(as.vector(hybrids_test$probe), c("bioB", "bioC", "bioD", "cre"))
+     graphs[i] <- hybrids %>% 
+       select(probe, expr = colnames(hybrids)[i+1]) %>% 
+       ggplot(aes(x = probe, y = expr, color = probe)) +
+       geom_boxplot()
+   }
+ }
Error in `select()`:
! Selections can't have missing values.
Run `rlang::last_trace()` to see where the error occurred.
There were 50 or more warnings (use warnings() to see the first 50)
> falses <- hybridsresults[hybridsresults$test == FALSE,]
> head(pheno_data[pheno_data$geo_accession %in% falses$`colnames(hybrids)[-1]`,])
                                            title geo_accession Patient_ID Age  Sex    Day
GSM401059  Blood sample from patient 106 at Day 1     GSM401059        106  40 Male  Day 1
GSM401060 Blood sample from patient 106 at Day 14     GSM401060        106  40 Male Day 14
GSM401061 Blood sample from patient 106 at Day 28     GSM401061        106  40 Male Day 28
GSM401064  Blood sample from patient 108 at Day 1     GSM401064        108  24 Male  Day 1
GSM401065 Blood sample from patient 108 at Day 14     GSM401065        108  24 Male Day 14
GSM401066 Blood sample from patient 108 at Day 28     GSM401066        108  24 Male Day 28
> plot(graphs[[1]])
> title(falses$`colnames(hybrids)[-1]`[1])

The result of the last two lines produces the graph: enter image description here

I used the data from GSM401059 and ran it through MAS5 to get the following:

> object <- ReadAffy(celfile.path = "/Users/willbaltazar/Desktop/GSM401059")
> values <- mas5(object)
background correction: mas 
PM/MM correction : mas 
expression values: mas 
background correcting...
done.
54675 ids to be processed
|                    |
|####################|
Warning messages:
1: replacing previous import 'AnnotationDbi::tail' by 'utils::tail' when loading 'hgu133plus2cdf' 
2: replacing previous import 'AnnotationDbi::head' by 'utils::head' when loading 'hgu133plus2cdf' 
> normalized_expression <- exprs(values)
> dim(normalized_expression)
[1] 54675     1
> # Check hybridization probe expression
> index <- grep(pattern = "cre|bioC|bioD|bioB", x = row.names(normalized_expression))
> head(index)
[1] 54661 54662 54663 54664 54665 54666
> normalized_expression[index,]
AFFX-r2-Ec-bioB-3_at AFFX-r2-Ec-bioB-5_at AFFX-r2-Ec-bioB-M_at 
            77.00275             11.60808             69.66242 
AFFX-r2-Ec-bioC-3_at AFFX-r2-Ec-bioC-5_at AFFX-r2-Ec-bioD-3_at 
            48.60067             46.31939            131.47989 
AFFX-r2-Ec-bioD-5_at  AFFX-r2-P1-cre-3_at  AFFX-r2-P1-cre-5_at 
            69.51501             82.69861             24.78448 
> # Plot hybridization probe expression
> hybrids <- as_tibble(normalized_expression[index,], rownames = "probe")
> for (i in 1:27) {
+   if (str_detect(hybrids[i,1], "cre")) {
+     hybrids[i,1] <- "cre"
+   } else if (str_detect(hybrids[i,1], "bioC")) {
+     hybrids[i,1] <- "bioC"
+   } else if (str_detect(hybrids[i,1], "bioD")) {
+     hybrids[i,1] <- "bioD"
+   } else {
+     hybrids[i,1] <- "bioB"
+   }
+ }
Error in if (str_detect(hybrids[i, 1], "cre")) { : 
  missing value where TRUE/FALSE needed
> hybrids$probe <- factor(hybrids$probe, levels = c("bioB","bioC","bioD","cre"))
> ggplot(hybrids, aes(x=probe,y = value, color = probe)) +
+   geom_boxplot()

enter image description here

ADD REPLY • link 16 months ago by Will • 0