biomaRt: Timeout on getBM().
2
1
Entering edit mode
3.4 years ago

Hi. I want to get the exon annotation of a list of chromosome interval (hg19). Here is the code:

getBM_value <- list(
  chromosome_name = bed_df$chromosome_name,
  start = bed_df$start,
  end = bed_df$end
)
mart <- useDataset('hsapiens_gene_ensembl', useMart('ensembl', host="grch37.ensembl.org"))
fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)

But it returns an error:

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:80] Operation timed out after 300001 milliseconds with 163661 bytes received

I've tried the mirror argument.

mart <- useEnsembl(biomart='ensembl', dataset='hsapiens_gene_ensembl', mirror = "uswest", GRCh = 37)

Warning message:
In useEnsembl(biomart = "ensembl", dataset = "hsapiens_gene_ensembl",  :
  version or GRCh arguments can not be used together with the mirror argument.', 
                'We will ignore the mirror argument and connect to main Ensembl site.

fupanel_bed_anno <- getBM(attributes = c('chromosome_name', 'exon_chrom_start', 'exon_chrom_end', 
                                         "strand", "ensembl_gene_id","ensembl_exon_id"), 
                          filters = c('chromosome_name', 'start', 'end'),
                          values = getBM_value,
                          mart = mart)

Error in curl::curl_fetch_memory(url, handle = handle) : 
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300001 milliseconds with 323405 bytes received
r biomart ensembl • 11k views
ADD COMMENT
0
Entering edit mode

Tagging: Emily_Ensembl

ADD REPLY
0
Entering edit mode

Tagging: Mike Smith

ADD REPLY
0
Entering edit mode

Trying a mirror is a good idea, but can't you see that your command with an alternate mirror site clearly wasn't carried out as you intended?

ADD REPLY
0
Entering edit mode

How big is your bed file?

ADD REPLY
0
Entering edit mode

232kb with a total of 7,983 rows

ADD REPLY
0
Entering edit mode

And how big are the regions in the bed file?

ADD REPLY
0
Entering edit mode

The bed file includes exons of 500 genes. 100 bp on average per row.

ADD REPLY
2
Entering edit mode
7 months ago

When calling getBM, you can provide a CURLHandle with custom settings. This worked for me when I had a similar timeout problem:

getBM( ... , curl = curl::new_handle(timeout_ms=3600000))
ADD COMMENT
0
Entering edit mode

Unfortunately, not all biomaRt commands accept a curl handle.

ADD REPLY
1
Entering edit mode

Many elements of biomaRt will now respect the setting applied in options('timeout'), so you can use that mechanism to try and adjust this.

I'm pretty sure that curl argument is no longer used and should be deprecated.

ADD REPLY
1
Entering edit mode
3.4 years ago
Emily 24k

The problem is that your query is just too big. We recommend a maximum of 500 regions in our web interface because of the limitations of the server and it's the same server for both the web interface and the R interface.

You could chunk your query in biomaRt, splitting your list up or even running each region as a single query.

Another option is to use the REST API overlap endpoint, which you can script around in your preferred programming language. This may be better than BioMart, since BioMart will be getting you all the genes that overlap your regions and all the exons of all of those genes. If you just want the exons that overlap your loci, the REST endpoint will do that for you.

ADD COMMENT
1
Entering edit mode

Emily,

I had a similar issue but my query is not large:

Failed <- getBM(attributes = c("ensembl_transcript_id", "ensembl_gene_id", "transcript_tsl"), mart = ensembl)

Error in curl::curl_fetch_memory(url, handle = handle) : Timeout was reached: [dec2021.archive.ensembl.org:443] Operation timed out after 300000 milliseconds with 9960752 bytes received

ADD REPLY
0
Entering edit mode

Emily,

I found this thread while coming across a similar issue. I am trying to find all refSNP idäs (rs#'s) for a set of immune system genes (there are 2960 Ensembl gene (ENSG) ID's in the set) and I am running into the timeout issue as well.

I have tried subsetting the query to do smaller and smaller increments of ENSG ID's, to the point where it is doing them one at a time in a for loop. I get through about 2 ENSG ID's and then during the 3rd it gives me the time out (see snippet of the code below).

What should I do?

mart.snp <- useMart("ENSEMBL_MART_SNP", "hsapiens_snp",host = "https//:grch37.ensembl.org")

Immune.rs_genes <- data.frame(matrix(ncol = 3, nrow=0))
colnames(Immune.rs_genes) <- c("refsnp_id","ensembl_gene_stable_id","associated_gene")

chunks <- function(x,n) split(x,cut(seq_along(x),n,labels=FALSE)) #Function that cuts vector x into n chunks
GO_immune_ENGID <- read.csv("GO_Immune_ENGIDs.csv") # a single column csv with 2960 ENSG IDs
Immune.rs_sets <-chunks(GO_immune_ENGID$x , 1500) #creates a list of 1500 vectors containing 2 ENSG IDs each

for (Sect in 1:1500) {

  Small_set <- getBM(attributes = c("refsnp_id","ensembl_gene_stable_id","associated_gene"), filters = "ensembl_gene", values = Immune.rs_sets[Sect], mart = mart.snp, verbose = TRUE)

  Immune.rs_genes <- rbind(Immune.rs_genes,Small_set)

  Sys.sleep(2) #Sleep for 2 seconds to avoid spamming the query system, I have tried with and without this.

  print(Sect)

}

Based on the Verbose print out from above, it does only query 2 ENSG IDs at a time. This is the output from the above

"""

Cache found

[1] 1

<?xml version='1.0' encoding='UTF-8'?><!DOCTYPE Query><Query  virtualSchemaName = 'default' uniqueRows = '1' count='0' datasetConfigVersion='0.6' header='1' formatter='TSV' requestid='biomaRt'> <Dataset name = 'hsapiens_snp'><Attribute name = 'refsnp_id'/><Attribute name = 'ensembl_gene_stable_id'/><Attribute name = 'associated_gene'/><Filter name = "ensembl_gene" value = "ENSG00000100345,ENSG00000134516" /></Dataset></Query>
Error in curl::curl_fetch_memory(url, handle = handle) :
  Timeout was reached: [grch37.ensembl.org:443] Operation timed out after 300000 milliseconds with 53069 bytes received

"""

The [1] is clearly the print output, so I am not sure what is going on. I checked the internet speed and it is 900mbs/sec on my end.

ADD REPLY

Login before adding your answer.

Traffic: 1750 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6