How to collapse multiple expression values for a same gene using information from GEO series matrix files?
1
0
Entering edit mode
5.0 years ago

Dear all,

Not been a bioinformatician I have minimal knowledge of using R to analyse microarray datasets. I was able to annotate the series matrix from GEO (GSE59045) using GEOquery, but I noted that there are multiple expression values for the same gene. Can somebody help me with a script for R to collapse the information, obtaining only one expression value to each gene? From other posts I learn that the best option is to select the highest value (maximum) for each gene, I will loss information regarding isoforms but it is not necessary for me at the moment.

Example

                       GSM1424930   GSM1424931  GSM1424932  GSM1424933         Gene Symbol  Ensembl

11715100_at 3.681680528 3.615040247 3.681680528 3.725140832 HIST1H3G ENSG00000273983 11715101_s_at 5.804431414 5.982370634 5.982370634 6.219341531 HIST1H3G ENSG00000273983
11715102_x_at 3.779383579 3.760277943 3.608772565 3.816631661 HIST1H3G ENSG00000273983 11715103_x_at 7.194430009 8.058842933 7.606162382 7.5365415 TNFAIP8L1 ENSG00000185361 1715104_s_at 5.286305369 5.338718503 5.499863475 5.616707797 OTOP2 ENSG00000183034

Thanks.

R gene microarrays GEO datasets • 1.8k views
ADD COMMENT
0
Entering edit mode
5.0 years ago
russhh 5.7k

You can use limma::avereps (https://www.rdocumentation.org/packages/limma/versions/3.28.14/topics/avereps). To use this, you define a set of groupings over which to summarise (here each group would be contain a single Ensembl ID). However, avereps returns a matrix, rather than an ExpressionSet, so it's use would lead to you losing the phenotype / treatment data that you need in statistical modelling:

Here's my solution:

collapse_by_feature_ids <- function(eset, id = "Ensembl") {
  .filter <- function() {
    id_set <- Biobase::fData(eset)[[id]]
    keep_rows <- which(!is.na(id_set) & id_set != "")
    eset[keep_rows, ]
  }
  .collapse <- function(eset) {
    fdata <- Biobase::fdata(eset)
    gene_groups <- fdata[[id]]
    averaged_matrix <- limma::avereps(eset, ID = gene_groups)
    Biobase::ExpressionSet(                                                                        
      averaged_matrix,                                                                             
      phenoData = Biobase::phenoData(eset)                                                                  
    )
  }
  stopifnot(is(eset, "ExpressionSet"))
  stopifnot(id %in% colnames(Biobase::fData(eset)))
  filtered_eset <- .filter()
  .collapse(filtered_eset)                                                                                   
}

This ensures that the phenotype / treatment data is retained and that an expressionSet is returned. It takes the arithmetic mean of the expression-log-intensities for each Ensembl ID. You should add a feature data data-frame in afterwards

ADD COMMENT
0
Entering edit mode

Also, you might want to ensure there are no empty-string or NA values in the Ensembl column

ADD REPLY
0
Entering edit mode

many thanks for your answer, no problem I will filter the matrix for NA values in Ensembl column. Can help me with an example of the procedure or suggesting me a tutorial to follow in R?

ADD REPLY
0
Entering edit mode

The featureData entry of an ExpressionSet is a (type of) data.frame, so you can find NAs and empty strings as you would in a normal data.frame. I've added some code to do it into the above function.

ADD REPLY

Login before adding your answer.

Traffic: 3760 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6