That is definitely a BioMart task. No way screen-scraping is going to work here.
On the Ensembl site, click BioMart, then design your query in the query builder. Once you are happy with the query, you can export it into a Perl script. You can modify the perl script to use with different accessions to use on the fly, or simply define a filter for gene ids via the web interface.
Edit:
Did a little experiment using the biomaRt package in R. And it looks it is feasible to download the data, if the query is crafted carefully. Here, I am just taking 10 gene. The query time might depend mostly on the number of variants per gene, which can vary every broadly. At this rate, downloading all variants for 1000 genes would take approximately 2 hours. The advantage of this approach is that it requires little if any parsing and less network traffic. If there are still concerns of overusing the server, one could break down the whole job into even smaller packages and distribute over a few hours.
library(biomaRt)
ensembl.mart <- useEnsembl(biomart = "ensembl")
d <- useDataset(mart = ensembl.mart, dataset = "hsapiens_gene_ensembl")
system.time( gene.ids <- unlist(getBM(mart = d, attributes = c("ensembl_gene_id" ) )))
# get variants one gene at a time not to overload the server
get.one <- function(gene.id) {
getBM(mart = d, attributes = c(
"ensembl_gene_id",
"variation_name",
"minor_allele_freq",
"allele" ), filters = list(ensembl_gene_id=gene.id))
}
res <- NULL
system.time(
for (i in 1:10) {
tmp <- get.one(sample(gene.ids, 1))
print (paste("got", nrow(tmp), "results" ))
if (nrow(tmp) > 0) {
res <- rbind(res, tmp)
}
Sys.sleep(5) # Pause a while not to overuse the server
}
)
dim(res)
object.size(res)
Output:
[1] "got 10069 results"
[1] "got 3090 results"
[1] "got 7685 results"
[1] "got 4224 results"
[1] "got 3 results"
[1] "got 3124 results"
[1] "got 1 results"
[1] "got 1 results"
[1] "got 1 results"
[1] "got 4816 results"
user system elapsed
2.721 0.265 68.091
> dim(res)
[1] 33014 4
> object.size(res)
3300368 bytes
Please, please, please don't do web-scraping.