Question

Availability of pre-ensembl information in biomart

0

Entering edit mode

7.2 years ago

longoka ▴ 40

Is anyone aware whether pre-ensembl information is available through a biomart query (I often use the biomaRt::biomaRt function in R to retrieve information)?

Specifically, I'm interested in using the biomaRt function to retrieve info on the crab-eating macaque, M fascicularis. But I've been unable to see a "preensembl_" attribute in the long list of attributes, and I'm unsure if it's either not a possibility or I'm simply looking in the wrong place.

This is listed as a pre-ensembl (Mfac5.0 pre-ensmbl) and info can be downloaded, etc, but I'm trying to avoid the extra processing of these files.

Another point regarding the 'biomaRt' function in the R package biomaRt; the code for accessing a particular mart ensemble is show in the snippet below; note that the ensemble specified in this case is for 'hsapiens_gene_ensembl'.

mart <- useMart("ensembl")
datasets <- listDatasets(mart)
mart <- useDataset("hsapiens_gene_ensembl",mart)

To the best of my knowledge, pre-ensembl datasets are not available; I'm wondering if they are available, but I'm using the wrong nomenclature (e.g., 'mfascicularis_gene_ensembl' should be 'mfac_gene_preensembl')?

As an alternative, I may just download the gff3 file from NCBI (Mfac5.0 gff) and use the makeTxDbFromGFF function in GenomicFeatures to make the TxDb object:

txdb <- makeTxDbFromGFF(file,
                format=c("gff3"),
                dataSource=NA,
                organism=NA,
                taxonomyId=NA,
                circ_seqs=DEFAULT_CIRC_SEQS,
                chrominfo=NULL,
                miRBaseBuild=NA,
                metadata=NULL,
                dbxrefTag)

This works nicely, but unfortunately deviates from all of my other code for several other genomes I'm looking at.

> genes(txdb)
GRanges object with 32733 ranges and 1 metadata column:
             seqnames                 ranges strand |     gene_id
                <Rle>              <IRanges>  <Rle> | <character>
     A1BG NC_022290.1 [ 58937416,  58951837]      - |        A1BG
     A1CF NC_022280.1 [ 85376919,  85454067]      + |        A1CF
    A2ML1 NC_022282.1 [  9098456,   9153857]      + |       A2ML1
  A3GALT2 NC_022272.1 [194630652, 194644767]      + |     A3GALT2
   A4GALT NC_022281.1 [  8344430,   8348390]      + |      A4GALT
      ...         ...                    ...    ... .         ...
   ZYG11A NC_022272.1 [174567376, 174637485]      - |      ZYG11A
   ZYG11B NC_022272.1 [174660062, 174713613]      - |      ZYG11B
      ZYX NC_022274.1 [176564635, 176574909]      + |         ZYX
    ZZEF1 NC_022287.1 [  3946682,   4101428]      - |       ZZEF1
     ZZZ3 NC_022272.1 [149532207, 149664372]      + |        ZZZ3
  -------
  seqinfo: 655 sequences from an unspecified genome; no seqlengths

biomaRt pre-ensemble Macaca fascicularis • 2.4k views

ADD COMMENT • link updated 2.2 years ago by Kevin Blighe 88k • written 7.2 years ago by longoka ▴ 40

2

Entering edit mode

We don't have BioMart for genomes in pre. These are genomes that have not yet been fully processed and do not have the full file structure, which means we don't have the BioMart tables. Crab eating macaque is due to appear in the next Ensembl release, due in December.

ADD REPLY • link 7.2 years ago by Emily 24k

0

Entering edit mode

I took a look at all available datasets under ensembl using listDatasets(useMart("ensembl")), and found Macaca mulatta, but this is not the exact species you need.

I have never heard of pre-ensembl datasets being made available through biomaRt. You may have to go down the manual route and build your own function in R that accepts M. fascicularis gene IDs and returns what ever you want to to return. I presume that you downloaded the GFT here: ftp://ftp.ensembl.org/pub/pre/gtf/macaca_fascicularis/ ?

ADD REPLY • link 7.2 years ago by Kevin Blighe 88k

score 1 · Answer 1 · 2022-10-31

An update 5 years later:

This is now in biomaRt:

require(biomaRt)
ensembl <- useMart('ensembl', dataset = 'mfascicularis_gene_ensembl')
annot <- getBM(
  attributes = c(
    'external_gene_name',
    'ensembl_gene_id',
    'gene_biotype',
    'external_synonym'),
  mart = ensembl)

head(annot)

  external_gene_name    ensembl_gene_id   gene_biotype external_synonym
1             TBXAS1 ENSMFAG00000016846 protein_coding           CYP5A1
2              PTGIS ENSMFAG00000033233 protein_coding           CYP8A1
3              EIF3K ENSMFAG00000033962 protein_coding          EIF3S12
4              ADAM5 ENSMFAG00000044676 protein_coding          tMDC II
5              ADAM5 ENSMFAG00000044676 protein_coding            TMDC2
6             NDUFA2 ENSMFAG00000049634 protein_coding            CI-B8

Kind regards,

Kevin