I'm working on an R script that downloads gene expression data from GEO, through Bioconductor and the getGEO() function.
These commands download all the 436 samples of the repository, but I'm only interested in 157 of them. Precisely, I'm interested in handling only the "samples collection:ch1" column with values ""on the 1st day of MI (admission)" or "N/A". How can I select this subset?
I tried to download the complete dataset into the gset
variable, and then to apply the following command:
gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]
but this way only the phenoData feature of the gset_reducted variable correctly contained the 157 samples. The other features of the gset_reducted variable, instead, still contain 436 samples.
How can I select my subset of samples right from the beginning?
Here's my working R code:
setwd(".")
options(stringsAsFactors = FALSE)
source("https://bioconductor.org/biocLite.R")
listOfBiocPackages <- c("oligo", "GEOquery", "affyio", "biomaRt", "sva", "pamr", "limma", "BiocParallel", "genefilter", "GO.db")
bioCpackagesNotInstalled <- which( !listOfBiocPackages %in% rownames(installed.packages()) )
cat("package missing listOfBiocPackages[", bioCpackagesNotInstalled, "]: ", listOfBiocPackages[bioCpackagesNotInstalled], "\n", sep="")
# check there's still something left to install
if( length(bioCpackagesNotInstalled) ) {
biocLite(listOfBiocPackages[bioCpackagesNotInstalled])
}
library("oligo")
library("GEOquery")
library("affyio")
library("biomaRt")
library("sva")
library("pamr")
library("limma")
library("BiocParallel")
GSE_code <- "GSE59867"
getGEOSuppFiles(GSE_code)
gset <- getGEO(GSE_code, GSEMatrix =TRUE, getGPL=FALSE)
if (length(gset) > 1) idx <- grep("GPL570", attr(gset, "names")) else idx <- 1
gset <- gset[[idx]]
gset_reduced <- gset
gset_reduced@phenoData@data <- gset@phenoData@data[gset@phenoData@data$"samples collection:ch1"=="on the 1st day of MI (admission)" | gset@phenoData@data$"samples collection:ch1"=="N/A", ]
Can anyone help me?
Thanks!
I did try to filter out the probes as suggested here, but the output turns out to be empty. I think it has to do with the different column names (GSMIDs) in eset and phenodata (Gender, Case/Control etc). Do I have to transpose one of them so that the filtering works? I plan to group the set in control and case groups and later perform MA plot between the groups. I am not sure if that's the right approach and would appreciate any suggestions.
This is what I tried
filter=colnames(g123)[g123@phenoData@data$"Source"=="CASE"] dim(filter) NULL
The source column in phenodata describes the sample status (case or control), and I would like to filter the samples based on their status.
This is my first post, and would apologize in advance for any missing information.
Thanks
Try
length(filter)
Thanks for the response Kevin.
I did try length(filter), and yet the out seems to be empty.
What are the contents of
g123@phenoData@data$"Source"
?Here are the contents of the g123@phenoData@data$"Source".
I just noticed that this column (source) is factorised, do you think that's the reason I cannot filter out the samples fro gset?
Possibly. Try encoding as characters with
as.character()
Well I did manage to convert the phenoData of g123 into character using 'lapply' function. And still I cannot filter gset based on the column source in the phenodata.
Any more suggestions.
My gset is a LumiBatch object can that be an issue?
Thanks for your response again.