Question

How to collapse multiple expression values for a same gene using information from GEO series matrix files?

0

Entering edit mode

5.0 years ago

pablojosegiraudi • 0

Dear all,

Not been a bioinformatician I have minimal knowledge of using R to analyse microarray datasets. I was able to annotate the series matrix from GEO (GSE59045) using GEOquery, but I noted that there are multiple expression values for the same gene. Can somebody help me with a script for R to collapse the information, obtaining only one expression value to each gene? From other posts I learn that the best option is to select the highest value (maximum) for each gene, I will loss information regarding isoforms but it is not necessary for me at the moment.

Example

                       GSM1424930   GSM1424931  GSM1424932  GSM1424933         Gene Symbol  Ensembl

11715100_at 3.681680528 3.615040247 3.681680528 3.725140832 HIST1H3G ENSG00000273983 11715101_s_at 5.804431414 5.982370634 5.982370634 6.219341531 HIST1H3G ENSG00000273983
11715102_x_at 3.779383579 3.760277943 3.608772565 3.816631661 HIST1H3G ENSG00000273983 11715103_x_at 7.194430009 8.058842933 7.606162382 7.5365415 TNFAIP8L1 ENSG00000185361 1715104_s_at 5.286305369 5.338718503 5.499863475 5.616707797 OTOP2 ENSG00000183034

Thanks.

R gene microarrays GEO datasets • 1.8k views

ADD COMMENT • link updated 5.0 years ago by russhh 5.7k • written 5.0 years ago by pablojosegiraudi • 0

score 0 · Answer 1 · 2020-01-07

You can use limma::avereps (https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/avereps). To use this, you define a set of groupings over which to summarise (here each group would be contain a single Ensembl ID). However, avereps returns a matrix, rather than an ExpressionSet, so it's use would lead to you losing the phenotype / treatment data that you need in statistical modelling:

Here's my solution:

collapse_by_feature_ids <- function(eset, id = "Ensembl") {
  .filter <- function() {
    id_set <- Biobase::fData(eset)[[id]]
    keep_rows <- which(!is.na(id_set) & id_set != "")
    eset[keep_rows, ]
  }
  .collapse <- function(eset) {
    fdata <- Biobase::fdata(eset)
    gene_groups <- fdata[[id]]
    averaged_matrix <- limma::avereps(eset, ID = gene_groups)
    Biobase::ExpressionSet(                                                                        
      averaged_matrix,                                                                             
      phenoData = Biobase::phenoData(eset)                                                                  
    )
  }
  stopifnot(is(eset, "ExpressionSet"))
  stopifnot(id %in% colnames(Biobase::fData(eset)))
  filtered_eset <- .filter()
  .collapse(filtered_eset)                                                                                   
}

This ensures that the phenotype / treatment data is retained and that an expressionSet is returned. It takes the arithmetic mean of the expression-log-intensities for each Ensembl ID. You should add a feature data data-frame in afterwards