downloading a Mus musculus genes dataset from ensembl biomart
2
1
Entering edit mode
9.9 years ago

iam been trying to download a dataset of mus musculus gene coding regions from biomart, ensembl to serve as reference to my query sewuences

the problem is that no matter how many times i try the file is only partially downloaded evident from the error message shown on my archive manager

has anyone faced a similar predicament and if so are there any suggestions to overcome this problem

blast ensembl mus musculus datasets • 4.4k views
ADD COMMENT
0
Entering edit mode

Are you behind a proxy server?

This is likely a local networking issue, but if you post the list of genes then it's likely that one of use can just post the compressed fasta file somewhere for you.

ADD REPLY
0
Entering edit mode

Yeah I am networking using proxy

I needed the following Attributes

Ensembl Gene ID
Ensembl Transcript ID
Coding sequence
Description
Associated Gene Name
Ensembl Protein ID
CDS Length

which I singled off in biomart

ADD REPLY
2
Entering edit mode
9.9 years ago
Emily 24k

This is a known problem. With large queries, BioMart is likely to lose connection with you partway through the download, which means you end up with only a partial dataset. There are a couple of solutions. The easiest one is to download the data files from the Ensembl FTP site. If you need something specific, that is not what is on the FTP site, you can get BioMart to email you your results, rather than download them directly. This means that BioMart doesn't have to communicate with you during the query and only needs to work internally then send the results to you.

ADD COMMENT
0
Entering edit mode

I need the following Attributes

Ensembl Gene ID
Ensembl Transcript ID
Coding sequence
Description
Associated Gene Name
Ensembl Protein ID
CDS Length

Can I get it from the ftp site?

ADD REPLY
0
Entering edit mode

You can get the CDS sequence from the FTP site. This README summarises what is in the header. You'll need to get the description, gene name, protein ID and CDS length from elsewhere: BioMart with results sent to you via email is probably your best best.

ADD REPLY
0
Entering edit mode
9.9 years ago
Tariq Daouda ▴ 220

Hi,

Unfortunately, the bigger your queries the more chances you have to encounter undesirable effects. The best solution, as suggested by Emliy, is to keep all the data locally. I encountered the same problem several times over, that's why I have written a python package that downloads all the data for a reference genome made available by Ensembl, and automatically stores it neatly into a database. You can find it here.

And here's the code to do want you want. First the importation:

from pyGeno.importation.Genomes import *
importGenome("Mus_musculus.GRCm38.78.tar.gz")

​And then :

from pyGeno.Genome import *

ref = Genome(name = "GRCm38.78")

for trans in ref.iterGet(Transcript) :
  print trans.gene.id
  print trans.id
  print trans.cDNA
  print trans.gene.biotype
  print trans.gene.name
  print trans.protein.id
  print len(trans.cDNA)

You can find the datawrap (package) for mus_musculus here.

Hope that helps

Cheers

ADD COMMENT

Login before adding your answer.

Traffic: 2516 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6