I've been experimenting with the GenomicDataCommons package to handle query work against the GDC API. For some reason there is an issue with timeouts for requests of a certain length. There doesn't seem to be a direct way around this using the piped syntax. Anyone else have luck with this?
There is a results(size = n) method, its syntax seems to allow only the first n records to be accessed.
Here is an example query (should return 400-500 records):
I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:
Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).
I have a fix for this but it's not very satisfactory. Based on Sean Davis' blog post I retrieved sequencing data and extracted the case id's from that, then looped over the case ids fetching diagnosis data 50 records at a time. The fact that this worked is a mystery to me, since the query against the files() endpoint returns even more records than the query posted above (each case may contain multiple samples).
Yes, I was just about to say that I reproduced the error and that you should report on the Bioconductor forum (and link back to this thread), where Sean Davis may pick it up more quickly: https://support.bioconductor.org/t/Latest/
It runs here but there's no output (?):
Thanks @Kevin Blighe, sloppy cut/paste job I forgot to add the project to the filter() method. Edited above.