I am using this code to extract the lengths of genes directly from ensemble:
ensembl = useEnsembl(biomart="ensembl", dataset="hsapiens_gene_ensembl")
genelength = getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id', 'transcript_length','cds_length'), filters = 'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
gene_canonical_transcript = getBM(attributes=c('ensembl_gene_id','ensembl_transcript_id','transcript_is_canonical'), filters = 'ensembl_gene_id', values = rownames(counts), mart = ensembl, useCache = FALSE)
gene_canonical_transcript_subset = gene_canonical_transcript[!is.na(gene_canonical_transcript$transcript_is_canonical),]
genelength = merge(gene_canonical_transcript_subset, genelength, by = c("ensembl_gene_id", "ensembl_transcript_id"))
return(genelength)
This however doesn't always work, I really don't know why it works only 50% of the time. Sometimes it uses mirror sites:
Ensembl site unresponsive, trying uswest mirror
sometimes it works perfectly fine without using mirror sites, and sometimes it fails completely, producing this error:
Error in bmRequest(request = request, httr_config = httr_config, verbose = verbose) :
Gateway Timeout (HTTP 504).
What causes this to be so unstable? I mean, the code works just fine (it gives gene lengths when it runs without any errors). Is the problem with the package? or with the site itself? is there anyway to avoid this error?
Thank you.
Out of interest, why are you setting
cache = FALSE
? The caching is designed for cases where you've already run a query and then want to run exactly the same thing again on a different day. It should load the results from disk and be much faster (especially if the Ensembl server isn't working). If the version of Ensembl changes or you ask for slightly different information then the cache will be ignored automatically, even if it's set tocache = TRUE
. My hope is that the cache can really help out in some situations without ever getting in the way, so I'm curious why you want to turn it off.