Question

GenomicDataCommons request timeouts cases() %>% ... %>% results_all()

2

Entering edit mode

6.4 years ago

mk ▴ 310

I've been experimenting with the GenomicDataCommons package to handle query work against the GDC API. For some reason there is an issue with timeouts for requests of a certain length. There doesn't seem to be a direct way around this using the piped syntax. Anyone else have luck with this?

There is a results(size = n) method, its syntax seems to allow only the first n records to be accessed.

Here is an example query (should return 400-500 records):

proj <- 'TCGA-COAD'
case_data <- cases() %>%
  GenomicDataCommons::filter(~ project.project_id == proj) %>%
  GenomicDataCommons::expand('diagnoses') %>%
  results_all() %>%
  as_tibble()

Gives, after a few moments:

Error in is.response(x) : Internal Server Error (HTTP 500).

GenomicDataCommons bioconductor R http • 2.3k views

ADD COMMENT • link updated 5.3 years ago by Biostar 20 • written 6.4 years ago by mk ▴ 310

0

Entering edit mode

It runs here but there's no output (?):

require(GenomicDataCommons)
require(tibble)
case_data <- cases() %>%
   GenomicDataCommons::filter(~ project.project_id == proj) %>%
   GenomicDataCommons::expand('diagnoses') %>%
   results_all() %>%
   as_tibble()
case_data
# A tibble: 0 x 0


sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.0.1             GenomicDataCommons_1.6.0 magrittr_1.5            
[4] arm_1.10-1               lme4_1.1-19              Matrix_1.2-15           
[7] MASS_7.3-51.1

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k

0

Entering edit mode

Thanks @Kevin Blighe, sloppy cut/paste job I forgot to add the project to the filter() method. Edited above.

ADD REPLY • link 6.4 years ago by mk ▴ 310

0

Entering edit mode

6.4 years ago

mk ▴ 310

I have a fix for this but it's not very satisfactory. Based on Sean Davis' blog post I retrieved sequencing data and extracted the case id's from that, then looped over the case ids fetching diagnosis data 50 records at a time. The fact that this worked is a mystery to me, since the query against the files() endpoint returns even more records than the query posted above (each case may contain multiple samples).

First get the files, and their associated cases:

proj <- 'TCGA-COAD'
tm_ge_files = files() %>%
GenomicDataCommons::filter(~   cases.samples.sample_type=='Primary Tumor' &
                           cases.project.project_id == proj &
                           analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
as_tibble()
tm_cases = bind_rows(tm_ge_files$cases, .id='file_id')

Now get the diagnoses:

left <- 1
right <- min(50, length(tm_cases$case_id))
clin <- gdc_clinical(tm_cases$case_id[left:right])$diagnoses
while(left < length(tm_cases$case_id)){
  left <- min((left + 50), length(tm_cases$case_id))
  right <- min((right + 50), length(tm_cases$case_id))
  clin <- rbind(clin,gdc_clinical(tm_cases$case_id[left:right])$diagnoses)
}

ADD COMMENT • link 6.4 years ago by mk ▴ 310

1

Entering edit mode

Yes, I was just about to say that I reproduced the error and that you should report on the Bioconductor forum (and link back to this thread), where Sean Davis may pick it up more quickly: https://support.bioconductor.org/t/Latest/

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k

1

Entering edit mode

Ok, I threw up a link on Bioconductor forum. In case the answer gets posted there I'll update this thread.

ADD REPLY • link 6.4 years ago by mk ▴ 310

0

Entering edit mode

Thanks - I'm a user there too but much less reputation score: https://support.bioconductor.org/u/16406/

ADD REPLY • link 6.4 years ago by Kevin Blighe 89k

score 5 · Accepted Answer · 2019-01-21

I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:

proj <- 'TCGA-COAD'
query = cases() %>%
    GenomicDataCommons::filter(~ project.project_id == proj) %>%
    GenomicDataCommons::expand('diagnoses')
count = query %>% count()
size = 50
reslist = lapply(seq(1,count, size), function(start) {
    query %>% 
        results(size=size, from = start) %>%
        as_tibble()
})
case_data = bind_rows(reslist)

Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).