I have tried the following two methods to download and read the GSE data, but failed at the end.
#first download the SOFT file to local, then read in.
> gse1 <- getGEOfile("GSE11724",destdir="F:/GSEfiles/",amount="brief")
File stored at:
F://GSEfiles//GSE11724.soft
> gse1
[1] "F:/GSEfiles//GSE11724.soft"
> ?getGEO
> getGEO(filename="F:/GSEfiles/GSE11724.soft")
Parsing....
Error in readLines(con, lineCounts[1, 1] - 1) : invalid 'n' argument
#the second method is download and parsing:
> gse1 <- getGEO("GSE11724",GSEMatrix=TRUE,getGPL=FALSE,destdir="F:/GSEfiles/")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/
OK
Found 1 file(s)
GSE11724_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/GSE11724_series_matrix.txt.gz'
Content type 'application/x-gzip' length 125545 bytes (122 KB)
downloaded 122 KB
Error in value[[3L]](cond) : duplicate row.names:
AnnotatedDataFrame 'initialize' could not update varMetadata:
perhaps pData and varMetadata are inconsistent?
In addition: Warning message:
closing unused connection 3 (F:/GSEfiles/GSE11724.soft)
I have checked the source code and found that getGEO
will invoke parseGEO
, which will invoke parseGSE
, the source of this function listed below. However, I cann't test the function, because the filegrep
function can not be found.
> parseGSE
function (fname, GSElimits = NULL)
{
gsmlist <- list()
gpllist <- list()
GSMcount <- 0
writeLines("Parsing....")
con <- fileOpen(fname)
lineCounts <- filegrep(con, "\\^(SAMPLE|PLATFORM)", chunksize = 10000)
message(sprintf("Found %d entities...", nrow(lineCounts)))
close(con)
con <- fileOpen(fname)
a <- readLines(con, lineCounts[1, 1] - 1)
header = parseGeoMeta(a)
for (j in 1:nrow(lineCounts)) {
tmp <- strsplit(as.character(lineCounts[j, 2]), " = ")[[1]]
accession <- tmp[2]
message(sprintf("%s (%d of %d entities)", accession,
j, nrow(lineCounts)))
entityType <- tolower(sub("\\^", "", tmp[1]))
nLinesToRead <- lineCounts[j + 1, 1] - lineCounts[j,
1] - 1
if (j == nrow(lineCounts)) {
nLinesToRead <- NULL
}
if (entityType == "sample") {
GSMcount = GSMcount + 1
if (is.null(GSElimits)) {
gsmlist[[accession]] <- .parseGSMWithLimits(con,
n = nLinesToRead)
}
else {
if ((GSMcount >= GSElimits[1]) & (GSMcount <=
GSElimits[2])) {
gsmlist[[accession]] <- .parseGSMWithLimits(con,
n = nLinesToRead)
}
else {
if (!is.null(nLinesToRead)) {
readLines(con, n = nLinesToRead)
}
}
}
}
if (entityType == "platform") {
gpllist[[accession]] <- .parseGPLWithLimits(con,
n = nLinesToRead)
}
}
close(con)
return(new("GSE", header = header, gsms = gsmlist, gpls = gpllist))
}
<environment: namespace:GEOquery>
Because I only want to extract the taxonomy information, I have resolved this problem by
scan
the downloaded files *.soft, and extract the data one by one. However, it seems that the parsing function from GEOquery need update.You should carefully read/look at/do the vignette before attempting to use it for your own good. If all you require is the taxid, please see my reply below.