Question

downloading a Mus musculus genes dataset from ensembl biomart

1

Entering edit mode

9.8 years ago

vigneshprbh37 ▴ 30

iam been trying to download a dataset of mus musculus gene coding regions from biomart, ensembl to serve as reference to my query sewuences

the problem is that no matter how many times i try the file is only partially downloaded evident from the error message shown on my archive manager

has anyone faced a similar predicament and if so are there any suggestions to overcome this problem

blast ensembl mus musculus datasets • 4.4k views

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by vigneshprbh37 ▴ 30

0

Entering edit mode

Are you behind a proxy server?

This is likely a local networking issue, but if you post the list of genes then it's likely that one of use can just post the compressed fasta file somewhere for you.

ADD REPLY • link 9.8 years ago by Devon Ryan 104k

0

Entering edit mode

Yeah I am networking using proxy

I needed the following Attributes

Ensembl Gene ID
Ensembl Transcript ID
Coding sequence
Description
Associated Gene Name
Ensembl Protein ID
CDS Length

which I singled off in biomart

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by vigneshprbh37 ▴ 30

Ram · Answer 1 · 2015-01-19

2

Entering edit mode

9.8 years ago

Emily 24k

This is a known problem. With large queries, BioMart is likely to lose connection with you partway through the download, which means you end up with only a partial dataset. There are a couple of solutions. The easiest one is to download the data files from the Ensembl FTP site. If you need something specific, that is not what is on the FTP site, you can get BioMart to email you your results, rather than download them directly. This means that BioMart doesn't have to communicate with you during the query and only needs to work internally then send the results to you.

ADD COMMENT • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Emily 24k

0

Entering edit mode

I need the following Attributes

Ensembl Gene ID
Ensembl Transcript ID
Coding sequence
Description
Associated Gene Name
Ensembl Protein ID
CDS Length

Can I get it from the ftp site?

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by vigneshprbh37 ▴ 30

0

Entering edit mode

You can get the CDS sequence from the FTP site. This README summarises what is in the header. You'll need to get the description, gene name, protein ID and CDS length from elsewhere: BioMart with results sent to you via email is probably your best best.

ADD REPLY • link updated 2.6 years ago by Ram 44k • written 9.8 years ago by Emily 24k

Ram · Answer 2 · 2015-01-19

Hi,

Unfortunately, the bigger your queries the more chances you have to encounter undesirable effects. The best solution, as suggested by Emliy, is to keep all the data locally. I encountered the same problem several times over, that's why I have written a python package that downloads all the data for a reference genome made available by Ensembl, and automatically stores it neatly into a database. You can find it here.

And here's the code to do want you want. First the importation:

from pyGeno.importation.Genomes import *
importGenome("Mus_musculus.GRCm38.78.tar.gz")

And then :

from pyGeno.Genome import *

ref = Genome(name = "GRCm38.78")

for trans in ref.iterGet(Transcript) :
  print trans.gene.id
  print trans.id
  print trans.cDNA
  print trans.gene.biotype
  print trans.gene.name
  print trans.protein.id
  print len(trans.cDNA)

You can find the datawrap (package) for mus_musculus here.

Hope that helps

Cheers