Bioconductor, how to select a subset of samples in an ExpressionSet?
2
0
Entering edit mode
6.0 years ago
Davide Chicco ▴ 120

I'm working on an R script that downloads gene expression data from GEO, through Bioconductor and the getGEO() function.

These commands download all the 436 samples of the repository, but I'm only interested in 157 of them. Precisely, I'm interested in handling only the "samples collection:ch1" column with values ""on the 1st day of MI (admission)" or "N/A". How can I select this subset?

I tried to download the complete dataset into the gset variable, and then to apply the following command:

gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]

but this way only the phenoData feature of the gset_reducted variable correctly contained the 157 samples. The other features of the gset_reducted variable, instead, still contain 436 samples.

How can I select my subset of samples right from the beginning?

Here's my working R code:

setwd(".")
options(stringsAsFactors = FALSE)

source("https://bioconductor.org/biocLite.R")

listOfBiocPackages <- c("oligo", "GEOquery", "affyio", "biomaRt", "sva", "pamr", "limma", "BiocParallel", "genefilter", "GO.db")

bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")

# check there's still something left to install
if( length(bioCpackagesNotInstalled) ) {
    biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}


library("oligo")
library("GEOquery")
library("affyio")
library("biomaRt")
library("sva")
library("pamr")
library("limma")
library("BiocParallel")

GSE_code <- "GSE59867"
getGEOSuppFiles(GSE_code) 
gset <- getGEO(GSE_code, GSEMatrix =TRUE, getGPL=FALSE)

if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]

gset_reduced <- gset

gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]

Can anyone help me?

Thanks!

bioconductor expressionset subset • 8.7k views
ADD COMMENT
4
Entering edit mode
6.0 years ago

The way that ExpressionSet objects work, you can just filter on the main object, and the changes will then carry through to all sub-components of the object:

filter <- colnames(gset)[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A"]

length(filter)
[1] 157

gset.filt <- gset[,filter]

gset.filt

ExpressionSet (storageMode: lockedEnvironment)
assayData: 33297 features, 157 samples 
  element names: exprs 
protocolData: none
phenoData
  sampleNames: GSM1448335 GSM1448338 ... GSM1620804 (157 total)
  varLabels: title geo_accession ... samples collection:ch1 (34 total)
  varMetadata: labelDescription
featureData: none
experimentData: use 'experimentData(object)'
Annotation: GPL6244

dim(exprs(gset.filt))
[1] 33297   157

dim(pData(gset.filt))
[1] 157  34

nrow(gset.filt@phenoData@data)
[1] 157

Kevin

ADD COMMENT
0
Entering edit mode

I did try to filter out the probes as suggested here, but the output turns out to be empty. I think it has to do with the different column names (GSMIDs) in eset and phenodata (Gender, Case/Control etc). Do I have to transpose one of them so that the filtering works? I plan to group the set in control and case groups and later perform MA plot between the groups. I am not sure if that's the right approach and would appreciate any suggestions.

This is what I tried

filter=colnames(g123)[g123@phenoData@data$"Source"=="CASE"] dim(filter) NULL

The source column in phenodata describes the sample status (case or control), and I would like to filter the samples based on their status.

This is my first post, and would apologize in advance for any missing information.

Thanks

ADD REPLY
0
Entering edit mode

Try length(filter)

ADD REPLY
0
Entering edit mode

Thanks for the response Kevin.

I did try length(filter), and yet the out seems to be empty.

ADD REPLY
0
Entering edit mode

What are the contents of g123@phenoData@data$"Source"?

ADD REPLY
0
Entering edit mode

Here are the contents of the g123@phenoData@data$"Source".

[1]  CASE     CASE     CONTROL  CASE     CASE     CASE     CASE     CONTROL  CONTROL
[10]  CONTROL  CASE     CONTROL  CONTROL  CASE     CASE     CASE     CASE     CASE   
[19]  CASE     CONTROL  CONTROL  CONTROL  CONTROL  CONTROL  CASE     CONTROL  CASE   
[28]  CONTROL  CASE     CASE     CONTROL  CONTROL  CASE     CASE     CASE     CONTROL
[37]  CASE     CONTROL  CONTROL  CASE     CASE     CASE     CASE     CASE     CONTROL
[46]  CASE     CONTROL  CASE     CONTROL  CASE     CONTROL  CONTROL  CONTROL  CASE   
[55]  CASE     CONTROL  CASE     CONTROL  CASE     CONTROL  CONTROL  CONTROL  CASE   
[64]  CONTROL  CONTROL  CASE   
Levels:  CASE  CONTROL

I just noticed that this column (source) is factorised, do you think that's the reason I cannot filter out the samples fro gset?

ADD REPLY
0
Entering edit mode

Possibly. Try encoding as characters with as.character()

ADD REPLY
0
Entering edit mode

Well I did manage to convert the phenoData of g123 into character using 'lapply' function. And still I cannot filter gset based on the column source in the phenodata.

Any more suggestions.

My gset is a LumiBatch object can that be an issue?

Thanks for your response again.

ADD REPLY
0
Entering edit mode
22 months ago
bennikkel • 0

Hi Davide,

I adapted Kevin's method. I think it will help. Notice that I'm using the tidyverse package.

filter_terms <- gset@phenoData@data %>%
  filter(str_detect(samples collection:ch1, 'on the 1st day of MI (admission)|N/A')) %>% 
  dplyr::select(samples collection:ch1) %>%
  as_vector()

filter <- gset@phenoData@data[gset@phenoData@data$samples collection:ch1 %in% filter_terms, ] %>% 
  rownames()

gset_filter <- gset[,filter]

For the dataset I used, I needed the row names from the phenoData to filter the column names of the expression data. This is why I used rownames() instead of colnames(gset) like Kevin suggested.

This is a weird combination of tidyverse and base R so it may not be perfect but it definitely worked for me.

Ben

ADD COMMENT

Login before adding your answer.

Traffic: 2034 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6