getGEO and getGEOfile error
2
0
Entering edit mode
7.7 years ago
pengchy ▴ 450

I have tried the following two methods to download and read the GSE data, but failed at the end.

#first download the SOFT file to local, then read in.

> gse1 <- getGEOfile("GSE11724",destdir="F:/GSEfiles/",amount="brief")
File stored at: 
F://GSEfiles//GSE11724.soft
> gse1
[1] "F:/GSEfiles//GSE11724.soft"
> ?getGEO
> getGEO(filename="F:/GSEfiles/GSE11724.soft")
Parsing....

Error in readLines(con, lineCounts[1, 1] - 1) : invalid 'n' argument

 

#the second method is download and parsing:

> gse1 <- getGEO("GSE11724",GSEMatrix=TRUE,getGPL=FALSE,destdir="F:/GSEfiles/")
https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/
OK
Found 1 file(s)
GSE11724_series_matrix.txt.gz
trying URL 'https://ftp.ncbi.nlm.nih.gov/geo/series/GSE11nnn/GSE11724/matrix/GSE11724_series_matrix.txt.gz'
Content type 'application/x-gzip' length 125545 bytes (122 KB)
downloaded 122 KB

Error in value[[3L]](cond) : duplicate row.names: 
  AnnotatedDataFrame 'initialize' could not update varMetadata:
  perhaps pData and varMetadata are inconsistent?
In addition: Warning message:
closing unused connection 3 (F:/GSEfiles/GSE11724.soft)

I have checked the source code and found that getGEO will invoke parseGEO, which will invoke parseGSE, the source of this function listed below. However, I cann't test the function, because the filegrep function can not be found.

> parseGSE
function (fname, GSElimits = NULL) 
{
    gsmlist <- list()
    gpllist <- list()
    GSMcount <- 0
    writeLines("Parsing....")
    con <- fileOpen(fname)
    lineCounts <- filegrep(con, "\\^(SAMPLE|PLATFORM)", chunksize = 10000)
    message(sprintf("Found %d entities...", nrow(lineCounts)))
    close(con)
    con <- fileOpen(fname)
    a <- readLines(con, lineCounts[1, 1] - 1)
    header = parseGeoMeta(a)
    for (j in 1:nrow(lineCounts)) {
        tmp <- strsplit(as.character(lineCounts[j, 2]), " = ")[[1]]
        accession <- tmp[2]
        message(sprintf("%s (%d of %d entities)", accession, 
            j, nrow(lineCounts)))
        entityType <- tolower(sub("\\^", "", tmp[1]))
        nLinesToRead <- lineCounts[j + 1, 1] - lineCounts[j, 
            1] - 1
        if (j == nrow(lineCounts)) {
            nLinesToRead <- NULL
        }
        if (entityType == "sample") {
            GSMcount = GSMcount + 1
            if (is.null(GSElimits)) {
                gsmlist[[accession]] <- .parseGSMWithLimits(con, 
                  n = nLinesToRead)
            }
            else {
                if ((GSMcount >= GSElimits[1]) & (GSMcount <= 
                  GSElimits[2])) {
                  gsmlist[[accession]] <- .parseGSMWithLimits(con, 
                    n = nLinesToRead)
                }
                else {
                  if (!is.null(nLinesToRead)) {
                    readLines(con, n = nLinesToRead)
                  }
                }
            }
        }
        if (entityType == "platform") {
            gpllist[[accession]] <- .parseGPLWithLimits(con, 
                n = nLinesToRead)
        }
    }
    close(con)
    return(new("GSE", header = header, gsms = gsmlist, gpls = gpllist))
}
<environment: namespace:GEOquery>
geoquery r bioconductor getGEO • 4.7k views
ADD COMMENT
0
Entering edit mode

Because I only want to extract the taxonomy information, I have resolved this problem by scan the downloaded files *.soft, and extract the data one by one. However, it seems that the parsing function from GEOquery need update.

ADD REPLY
1
Entering edit mode

You should carefully read/look at/do the vignette before attempting to use it for your own good. If all you require is the taxid, please see my reply below.

ADD REPLY
2
Entering edit mode
7.7 years ago
theobroma22 ★ 1.2k

Ok, I see. At first I thought you were trying to call a dataset from the GEOquery library. But this is incorrect. So, you can't use the code you provided because that is to call the data that comes with the package library, as it is stated in the vignette.

You should use:

gse1 = getGEO("GSE11724", GSEMatrix = FALSE)

This will take awhile to download all 26 files.

Then,

> show(gse1)

An object of class "GSE" Slot "header": $contact_address [1] "9 Cambridge Center"

$contact_city [1] "Cambridge"

$contact_country [1] "USA"

$contact_department [1] "Genome Technology Core"

$contact_email [1] "sgupta@wi.mit.edu"

$contact_institute [1] "Whitehead Institute"

$contact_name [1] "Whitehead,,Institute"

$contact_phone [1] "617-324-0339"

$contact_state [1] "MA"

$contact_zip/postal_code [1] "02141"

$contributor [1] "Richard,A,Young"

$email [1] "geo@ncbi.nlm.nih.gov"

$geo_accession [1] "GSE11724"

$institute [1] "NCBI NLM NIH"

$last_update_date [1] "Dec 07 2016"

$name [1] "Gene Expression Omnibus (GEO)"

$overall_design [1] "ChIP-seq in murine embryonic stem cells for Oct4, Sox2 (2 runs), Nanog (2 runs), Tcf3 (2 r...

$platform_id [1] "GPL9185"

$platform_taxid [1] "10090"

$pubmed_id [1] "18692474"

$relation [1] "SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRP000712"
[2] "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA106023"

$sample_id [1] "GSM307137" "GSM307138" "GSM307139" "GSM307140" "GSM307141" "GSM307142" "GSM307143" "GSM307144" "GSM307145" [10] "GSM307146" "GSM307147" "GSM307148" "GSM307149" "GSM307150" "GSM307151" "GSM307152" "GSM307153" "GSM307154" [19] "GSM307155" "GSM307156" "GSM307157" "GSM307158" "GSM307159" "GSM307160" "GSM307161"

$sample_taxid [1] "10090"

$status [1] "Public on Jul 30 2008"

$submission_date [1] "Jun 09 2008"

$summary [1] "MicroRNAs (miRNAs) are crucial for normal embryonic stem (ES) cell self-renewal and cellular differentiation, but how miRNA gene expression is controlled by the key transcriptional regulators of ES cells has not been established... [2] ""
[3] "Keywords: ChIP-seq analysis ...

$supplementary_file [1] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP007/SRP000712" [2] "ftp://ftp-trace.ncbi.nlm.nih.gov/sra/sra-instant/reads/ByStudy/sra/SRP/SRP000/SRP000712" [3] "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE11724/GSE11724_RAW.tar" ......... ......... .........

It's unfortunate but the submitter may have submitted the pData and varMetaData with errors. I've found that this is not uncommon. So, you might have to do what you want to do outside of the GEOquery package and handle the data yourself.

Hope this helps.

ADD COMMENT
1
Entering edit mode
7.7 years ago
theobroma22 ★ 1.2k

In the first section you seem to be missing a forward slash in the filename when you called the getGEO function. Did you try using filename=gse1? Is the file in that directory and did you set the working directory to that location?

ADD COMMENT
0
Entering edit mode

Thank you, I have tried as you suggested, the error remain.

ADD REPLY

Login before adding your answer.

Traffic: 1819 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6