I'm trying to get the data for the following publication:
http://www.nature.com/nature/journal/v447/n7147/full/nature05886.html
The specific GEO accession I'm trying to retrieve is GSE7606 (its a sub-series of GSE7615, the full dataset for the paper mentioned). I've used GEOquery to download the supplementary files for the series and extracted them:
require(GEOquery)
gseid <- "GSE7606"
supp.melanoma <- getGEOSuppFiles(gseid)
## manually un-tar/gunzip them
Since they are CGH profiles, I'm reading them as such:
require(limma)
datapath <- "/path/to/data/GSE7606/"
filenames <- list.files(datapath, pattern="GSM.*.txt")
cgh.data <- read.maimages(files=filenames,
path=datapath,
columns=list(G="gMedianSignal", Gb="gBGMedianSignal",
R="rMedianSignal", Rb="rBGMedianSignal"),
annotation=c("Row", "Col","FeatureNum", "ControlType","ProbeName",
"ProbeUID", "SystematicName", "GeneName"),
source='agilent')
I want to segment them for CGH analysis. For whatever reason, the files don't have the chromosomal locations included. OK, so I'll get them from the GPL (which according to the GSE7606 is GPL887). Also of note, a txt file of a supposed old version of the GPL data for these files is included in the supplementary data, which we will see does not work:
# try to get directly from GEO; this works!
gpl887 <- getGEO("GPL887", destdir="./data/GSE7606/")
# try to read from their file; doesn't work!
gpl887.included <- getGEO(filename=paste(datapath, "GPL887_old_annotations.txt", sep="/"))
But their file does not load correctly:
> gpl887.included
An object of class "GPL"
An object of class "GEODataTable"
****** Column Descriptions ******
data frame with 0 columns and 0 rows
****** Data Table ******
data frame with 0 columns and 0 rows
Furthermore, I can't match up IDs from nearly half of the probes in the CGH data with annotations from the GPL data:
> ingpl <- cgh.data$gene$ProbeName %in% Table(gpl887)$SPOT_ID
> summary(ingpl)
Mode FALSE TRUE NA's
logical 10295 11858 0
I've also tried another GSE that has the same platform, with the same results.
Also, trying to load the GSE directly does not work either, and may point to the same problem:
> data.melanoma <- getGEO("GSE7606", destdir=datadir)
Found 1 file(s)
GSE7606_series_matrix.txt.gz
% Total % Received % Xferd Average Speed Time Time Time Current
Dload Upload Total Spent Left Speed
100 9709k 100 9709k 0 0 16.9M 0 --:--:-- --:--:-- --:--:-- 17.9M
File stored at:
/tmp/Rtmp8NZ4Fj/GPL887.soft
Error in validObject(.Object) :
invalid class “ExpressionSet” object: featureNames differ between assayData and featureData
What am I missing; how can I get the proper chromosomal coordinates for the probes on this chip?
Thanks!
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-redhat-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=en_US.utf8 LC_MESSAGES=en_US.utf8
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] sva_3.0.3 mgcv_1.7-13 corpcor_1.6.2
[4] DAVIDQuery_1.14.0 RCurl_1.91-1 bitops_1.0-4.1
[7] GOstats_2.20.0 Category_2.20.0 GEOquery_2.21.9
[10] topGO_2.6.0 SparseM_0.96 GO.db_2.6.1
[13] graph_1.32.0 hgu133a2.db_2.6.3 org.Hs.eg.db_2.6.4
[16] RSQLite_0.11.1 DBI_0.2-5 limma_3.10.3
[19] annotate_1.32.3 AnnotationDbi_1.16.19 gcrma_2.26.0
[22] affy_1.32.1 Biobase_2.14.0 ggplot2_0.9.0
[25] reshape_0.8.4 plyr_1.7.1 ProjectTemplate_0.3-5
[28] testthat_0.6
loaded via a namespace (and not attached):
[1] affyio_1.22.0 BiocInstaller_1.2.1 Biostrings_2.22.0
[4] colorspace_1.1-1 dichromat_1.2-4 digest_0.5.2
[7] evaluate_0.4.1 genefilter_1.36.0 grid_2.14.1
[10] GSEABase_1.16.1 IRanges_1.12.6 lattice_0.20-6
[13] MASS_7.3-17 Matrix_1.0-4 memoise_0.1
[16] munsell_0.3 nlme_3.1-103 preprocessCore_1.16.0
[19] proto_0.3-9.2 RBGL_1.30.1 RColorBrewer_1.0-5
[22] reshape2_1.2.1 scales_0.2.0 splines_2.14.1
[25] stringr_0.6 survival_2.36-12 tools_2.14.1
[28] XML_3.9-4 xtable_1.7-0 zlibbioc_1.0.1
I doubt that this is the problem; on Unix-like systems, an extra forward slash in the path does not matter. In addition I can replicate the issue using the downloaded file.
Ah! You are right. My bad.
Thanks! But yeah, not the culprit. I examined the format of the file, as it's supposed to be in SOFT format and compared to the GEO format specification (http://www.ncbi.nlm.nih.gov/projects/geo/info/soft2.html) and didn't see any glaring problems.
I've edited the post to include a few more things I tried - another GSE, and loading the GSE directly with getGEO (neither work).